WO1998016897A1 - Reconnaissance de formes isomorphes - Google Patents
Reconnaissance de formes isomorphes Download PDFInfo
- Publication number
- WO1998016897A1 WO1998016897A1 PCT/US1997/019130 US9719130W WO9816897A1 WO 1998016897 A1 WO1998016897 A1 WO 1998016897A1 US 9719130 W US9719130 W US 9719130W WO 9816897 A1 WO9816897 A1 WO 9816897A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cipher
- plaintext
- word
- labels
- words
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 55
- 238000003909 pattern recognition Methods 0.000 claims abstract description 26
- 230000008569 process Effects 0.000 claims description 31
- 238000003384 imaging method Methods 0.000 claims description 7
- 238000004458 analytical method Methods 0.000 description 43
- 239000011159 matrix material Substances 0.000 description 21
- 238000013507 mapping Methods 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 14
- 238000003491 array Methods 0.000 description 13
- 239000002131 composite material Substances 0.000 description 13
- 238000012805 post-processing Methods 0.000 description 12
- 238000007781 pre-processing Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000012795 verification Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 238000012015 optical character recognition Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 238000012550 audit Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241001137251 Corvidae Species 0.000 description 1
- BQCADISMDOOEFD-UHFFFAOYSA-N Silver Chemical compound [Ag] BQCADISMDOOEFD-UHFFFAOYSA-N 0.000 description 1
- 241000282887 Suidae Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 238000012916 structural analysis Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 101150013925 whiA gene Proteins 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- the present invention relates to systems and methods for pattern recognition, and more specifically to systems and methods for character recognition using Isomorphic Pattern Recognition. Description of Related Art
- Recognition accuracy is too sensitive to skewed, faded, and noisy raster (image) documents.
- the present invention is based upon the answer to an unintuitive question: Is it possible to determine the identity of a letter in a way that is independent of its optical features? For example, what if the letter "e” had its optical features altered within a raster document, and replaced everywhere with the symbol of a tiny rectangle having a silver dollar at its center. would existing OCR technologies be able to determine that this unusual symbol represents plaintext "e” ? The answer is clearly no, since the optical properties of this symbol look nothing like what traditional OCR technologies would expect for the letter "e”. The present invention, however, correctly identifies this randomly chosen symbol as plaintext "e”, quickly and accurately. This subtle insight, along with many others, allows a truly automated approach to recognition that is invisible to users, and acceptable to mainstream customers.
- a resource receives a plurality of representations of a plurality characters.
- a resource assigns a plurality of cipher labels to representations in the plurality of representations, and a resource identifies a plaintext identity of cipher labels in the plurality of cipher labels using pattern recognition.
- the pattern recognition includes Isomorphic Pattern Recognition.
- the resource that assigns the plurality of cipher labels to representations in the plurality of representations assigns a different cipher label to each different representation in the plurality of representations.
- the resource that assigns the plurality of cipher labels to representations in the plurality of representations produces a plurality of cipher words.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates sets of cipher labels selected from the plurality of cipher labels.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates a Document Word List which contains cipher words in the plurality of cipher words.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that selects a Selected Word from the Document Word List, the Selected Word containing at least one cipher label to be identified.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates a candidate plaintext word list of plaintext words in a plurality of plaintext words that have a same isomorphic pattern as a Selected Word.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that selects a Trial Word from the Document Word List, the Trial Word containing at least one cipher label in common with the Selected Word.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words that cannot also form a valid plaintext Trial Word.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words using a Raster- Plaintext Library wherein the Raster-Plaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that substitutes a plaintext identity for identified cipher labels in cipher words in the plurality of cipher words in the Document Word List.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that attempts to identify cipher labels in the plurality of cipher labels until all cipher labels have been identified.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that assigns a Pattern Tag to a word in a Document Word List which contains unrecognized cipher labels, wherein the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.
- Another embodiment of the invention comprises a memory, a resource for receiving a received cipher pattern, and a processor, wherein the processor identifies plaintext words in a plurality of plaintext words stored in the memory which have a same cipher pattern as the received cipher pattern.
- the plaintext words in the plurality of plaintext words identified by the processor are isomorphic.
- Still another aspect of the invention comprises an imaging device which generates an image of a document, and the processor includes a resource that translates the image into a plurality of cipher patterns, and supplies the received cipher patterns to the processor.
- the plurality of plaintext words includes words in more than one language.
- Another embodiment of the invention comprises receiving a plurality of representations representing a document, the representations containing a plurality of characters, assigning cipher labels to characters in the plurality of character in representations in the plurality of representations, and identifying a plaintext identity of cipher labels in the plurality of cipher labels using pattern recognition.
- pattern recognition includes
- assigning cipher labels includes assigning a different cipher label to each different representation in the plurality of representations.
- assigning cipher labels to characters in the plurality of characters in representations in the plurality of representations includes producing a plurality cipher words.
- identifying the plaintext identity of cipher labels in the plurality of cipher labels includes creating a Document Word List which contains cipher words in the plurality of cipher words. A Selected Word is selected from the Document Word List, the Selected Word containing at least one cipher label to be identified.
- a candidate plaintext word list is created containing plaintext words in a plurality of plaintext words that have a same isomorphic pattern as the Selected Word.
- a Trial Word is selected from the Document Word List, the Trial Word containing at least one cipher label in common with the Selected Word.
- Candidate plaintext words that cannot also form valid plaintext Trial Words are eliminated.
- identifying the plaintext identity of cipher labels in the plurality of cipher labels includes eliminating candidate plaintext words using a Raster- Plaintext Library wherein the Raster-Plaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters.
- identifying the plaintext identity of cipher labels in the plurality of cipher labels includes assigning a Pattern Tag to a cipher word in the Document Word List which contains unrecognized cipher labels, wherein the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.
- Still another embodiment of the invention comprises an imaging device which generates a representation of a document, the representation including a plurality of images of characters in the document, a processor coupled to the imaging device capable of receiving the representation of the document, the processor assigning a plurality of cipher labels to images in the plurality of images, and the processor using pattern recognition to assign plaintext characters to cipher labels in the plurality of cipher labels.
- the pattern recognition may include Isomorphic Pattern Recognition.
- the processor includes a resource which assigns a Pattern Tag to a cipher word in the Document Word List which contains unidentified cipher labels.
- the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.
- the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words using a Raster-Plaintext Library wherein the Raster-Plaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters.
- a raster document is converted into a ciphertext document wherein each raster character is related to a different ciphertext label. This converts words comprised of raster characters into cipher words comprised of ciphertext labels.
- a Document Word List is then created from the ciphertext document. The Document Word List contains one copy of each unique cipher word in the ciphertext document. A word is chosen from the Document Word List which contains at least one cipher label which is desired to be identified. This word will be referred to as the Selected Word.
- the plaintext Isomorphic Dictionary is then used to identify candidate plaintext words that have the same isomorphic pattern as the Selected Word. Trial Words are then chosen from the Document Word List to eliminate unacceptable candidates from the list of candidate plaintext words.
- the Trial Word should have at least one cipher label in common with the Selected Word in order to help eliminate candidate plaintext words. For each Selected Word, Trial Words continue to be selected in an attempt to eliminate unacceptable candidates from the list of candidate plaintext words until only one candidate plaintext word remains.
- the Selected Word is skipped and a new Selected Word is chosen. If Trial Words are able to be used to eliminate all but one candidate plaintext word, then the one remaining candidate plaintext word is substituted back into the Document Word List in the place of the Selected Word. Additionally, the plaintext identities of the cipher labels identified in the Selected Word are then substituted throughout the ciphertext document and the Document Word List. A new Selected Word is then chosen and the process is repeated until all the cipher labels that can be recognized have been recognized.
- the longer words in the Document Word List are chosen first because they usually contain the greatest number of unique cipher labels to be identified . Also, longer Selected Words generally result in fewer candidate plaintext words, requiring fewer Trial Words to be selected and resulting in more cipher labels being identified. This speeds up the character recognition process.
- words in the Document Word List which still contain unrecognized cipher labels after all of the cipher labels that can be identified have been identified, are assigned a Pattern Tag.
- the Pattern Tag presents possible plaintext answers for the word and is useful, for example, when a full text search of the recognized document is performed.
- Figure 1 is a schematic diagram representing an embodiment of the present invention depicting various possible input devices.
- Figure 2 depicts an embodiment of the system showing a document to be feed into a scanner attached to the computer system.
- Figure 3 shows an embodiment of the system in which the computer system has an external connection with another computer system.
- Figure 4 shows a sample multi-language ciphertext document and a document word list generated from the document.
- Figure 5 shows relevant excerpts from a sample English German plaintext dictionary.
- Figure 6 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #1 and Trial Word #1(1).
- Figure 7 depicts selected word #1 and Trial Word #1(1) and their respective
- Figure 8 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #11 and Trial Word #11(1).
- Figure 9 depicts selected word #11 and Trial Word #11(1) and their respective
- Figure 10 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #111 and Trial Word #111(1).
- Figure 11 depicts selected word #111 and Trial Word #111(1) and their respective Inherited isomorphic patterns along with Candidate plaintext words from the dictionary.
- Figure 12 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #IV, Trial Word #IV(1), and Trial
- Word #IV(2) Figure 13 depicts selected word #IV, Trial Word #IV(1), and Trial Word
- Figure 14 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #V, and Trial Word #IV(1).
- Figure 15 depicts selected word #V, and Trial Word #V(1) and their respective
- Figure 16 depicts the solution mappings found as a result of the recognition and the plaintext identity of cipher words in the document word list.
- Figure 17 shows the "English” and “German” plaintext solutions.
- Figure 18 depicts an example of the use of the Library function to help identify the plaintext identity of cipher labels.
- Figure 19 depicts an embodiment of the invention in which a search engine recreates and searches a list of candidate plaintext words.
- Figure 20 is a blob showing the letter "c.”
- Figure 21 shows two blobs and the results of the XOR operation.
- Figure 22 shows an image of a "t" touching an "n.”
- Figure 23 shows an image of a non-composite "n.”
- Figure 24 shows a sample Answer-Array.
- Figure 25 shows another sample Answer-Array.
- Figure 26 shows an array rearranged so that words having the most repeated letters appear first.
- Figure 27 depicts a Trial Word array.
- Figure 28 depicts an Isomorphic Array.
- Figure 31 shows the matrices used to generate the candidate plaintext answers for the Pattern Tag.
- Figure 32 shows a sample TW- Array.
- Figure 33 shows a sample Touching_Array for "managem@t.”
- Figure 34 shows the @- Array.
- Figure 35 shows an example of an array grouping of common word both non- pattern and pattern words.
- Figure 1 is a diagram illustrating recognition system 100.
- computer system 102 is connected through cable 104 to one or more input devices.
- Input devices may include, but are not limited to traditional scanner 106, multifunction product with scanner 108, storage device 110, network connection 112, digital photocopier 114, or touch sensitive device 116, or any other device which can provide input to computer system 102. Without deviating from the present invention, any of theses devices could be incorporated into computer system 102.
- Computer system 102 contains plaintext Isomorphic Dictionary 118 and Isomorphic Recognition resource 120.
- computer system 102 may include a Raster-Plaintext Library.
- recognition system 100 may include a hand-held screen input device. Such a device may create a raster document or representations directly from the images input on the screen.
- Figure 2 depicts another embodiment of the present invention, recognition system 200.
- computer system 202 is connected through cable 204 to scanner 206.
- Document 208 contains characters in a plurality of characters which it is desired to recognized.
- Document 208 is fed into scanner 206.
- Scanner 206 converts document 208 into a plurality of representations which corresponds to characters in the plurality of characters in document 208. Representations in the plurality of representations are then transferred to computer system 202 through cable 204.
- Representations in the plurality of representations are stored in memory 210 in computer system 202. Then, as discussed in detail below, processor 212 in computer system 202 then processes representations in the plurality of representations stored in memory 210 and uses plaintext Isomorphic Dictionary 214 and Isomorphic Recognition resource 216 to help identify characters in the plurality of characters represented by representations in the plurality of representations. Additionally, computer system 202 may include a Raster- Plaintext Library.
- Figure 3 depicts networked recognition system 300.
- Networked recognition system 300 includes computer system 302 connected through cable 304 to digital photocopier system 306.
- Document 308 contains characters in a plurality of characters which it is desired to recognized.
- Document 308 is fed into digital photocopier 306.
- Digital photocopier 306 converts document 308 into representations in a plurality of representations which corresponds to characters in a plurality of characters in document 308. Representations in the plurality of representations are then transferred to computer system 302 through cable 304.
- processor 312 in computer system 302 processes representations in the plurality of representations stored in memory 310 and uses plaintext Isomorphic Dictionary 314 and Isomorphic Recognition resource 316 to help identify characters in the plurality of characters represented by representations in the plurality of representations.
- computer system 202 may include a Raster- Plaintext Library, and the Raster-Plaintext Library may also be used to help identify characters in the plurality of characters represented by representations in the plurality of representation.
- the identified characters in the plurality of characters are used to construct electronic document 318 corresponding to document 308.
- Electronic document 318 can be sent over network connection 320 to other computers 322.
- Network connection 320 can include local-area networks, wide-area networks, the Internet, or any other connection to a system external to computer system 302.
- the mapping between cipher labels and plaintext letters in order to perform character recognition using the present invention proceeds as follows.
- a comprehensive plaintext Isomorphic Dictionary is constructed. This dictionary should be arranged so that all plaintext words in the dictionary having the same Root isomo ⁇ hic pattern are grouped together.
- This plaintext Isomo ⁇ hic Dictionary may also be multi- language, i.e. it may contain a superset of the words from many different languages, where each language is alphabetical in construction. This will allow multi-language ciphertext documents to be solved as easily as ciphertext documents in one language only.
- a document is scanned into a recognition system, such as that shown in Figure 2.
- the representations of the characters in the document are stored in memory 210.
- processor 212 uses the representations to construct a ciphertext document containing cipher words.
- a sample ciphertext document is shown in Figure 4.
- a Document Word List as shown in Figure 4, is created which contains one copy of each unique cipher word in the ciphertext document.
- This embodiment of the invention only has the information about the document contained in the representations, thus the cipher labels are unique, assigned to the representations based only on information contained within the representations.
- a word is chosen from the Document Word List that contains at least one cipher label that we wish to recognize. This chosen word will be referred to here as the Selected Word.
- the comprehensive plaintext Isomorphic Dictionary is used to find all plaintext words that have the same Root isomorphic pattern as the Selected Word.
- Figure 5 shows a sample of relevant excepts from an embodiment of a plaintext Isomorphic Dictionary containing English and German words. In Figure 5, Root isomorphic patterns are represented by capital letters and plaintext words are represented by lower case letters. This list of plaintext words will be referred to here as the candidate plaintext words.
- the goal is to attempt to determine which of these candidate plaintext words is the one correct answer being sought, i.e. the correct plaintext substitute for the Selected Word.
- Trial Words are used.
- the Trial Words are chosen to help eliminate unacceptable candidates from among the list of candidate plaintext words.
- Each Trial Word is chosen from the Document Word List as depicted in Figure 6.
- each Trial Word is allowed to come from a language different from the language of the Selected Word. The only requirement is that each Trial Word must have at least one cipher label in common with the Selected Word.
- the solution methodology consists of examining each candidate plaintext word, one at a time, and eliminating those candidates that cannot also form a valid plaintext Trial Word.
- a plaintext word is considered valid if that word is contained within the comprehensive plaintext Isomorphic Dictionary. This process is depicted in detail in Figure 7.
- the Selected Word #1 on the top line above is converted into its Inherited Isomorphic pattern.
- the Inherited isomo ⁇ hic pattern is equal to Root isomorphic pattern for this Selected Word.
- Using Group #1 in Figure 5 all of the candidate plaintext words are listed for the Inherited isomorphic pattern.
- Trial Word #1(1) In the Inherited isomorphic pattern for Trial Word #1(1), "K” is chosen to represent "S” from the Original pattern, to distinguish it from the inherited labels ("L,M, etc.” could also be used).
- Trial Word #1(1) we now see if its possible to eliminate any unacceptable candidate plaintext words.
- a candidate for Selected Word
- Trial Words We stop choosing Trial Words when one candidate plaintext word remains (for the Selected Word).
- Trial Words continue to be selected even after only one candidate plaintext word remains in order to verify that the remaining candidate plaintext word is the correct one.
- the Inherited isomorphic pattern for the Trial Word must first be converted into its equivalent Root isomo ⁇ hic pattern before using the dictionary.
- Trial Words continue to be chosen in an attempt to eliminate additional unacceptable candidate plaintext words until one and only one candidate plaintext word remains. If only one plaintext word remains, then it is assumed to be the one correct answer being sought, and it is substituted in place of the Selected Word. This establishes a mapping between the cipher label(s) in this Selected Word and their plaintext identities (in the candidate word), i.e., we have performed character recognition on these cipher labels.
- each word in the Document Word List containing (unrecognized) cipher labels is assigned a Pattern Tag, which present possible plaintext answers for that word.
- Pattern Tags allow these words to be found during a full-text search of the recognized document. For example if a search is conducted in the document for a plaintext word. The use of Pattern Tags allows a search engine to return the appropriate plaintext result. If any cipher word in the Document Word List still has unidentified cipher labels, but is isomorphic with the plaintext word being searched for and the plaintext letters in the cipher word are the same as the plaintext letters in the search word then the Pattern Tag will return the appropriate plaintext word.
- Raster-Plaintext Library in order to attempt to identify which of the candidate plaintext words is the correct one.
- the Raster- Plaintext Library is used in addition to the use of Trial Words.
- the Raster-Plaintext Library is used instead of Trial Words.
- the Raster-Plaintext Library contains extensive data on raster letters, and their plaintext identities, from an exhaustive collection of fonts.
- a "Library Function" L(R,P) can be defined as follows:
- the raster letter 'R' and its equivalent cipher label 'C will be used interchangeably in the "Library Function", i.e. L(C,P) implies L(R,P) and L(R,P) implies L(C,P), with the understanding that cipher label 'C is just a convenient placeholder for the actual raster letter 'R'.
- N sw total number of instances of cipher labels in the Selected Word.
- This "Library Function” is used to help eliminate unacceptable candidates from among the list of candidate plaintext words generated as discussed above.
- the plaintext identities of the cipher labels recognized are then substituted throughout the Document Word List, into all cipher words containing these cipher labels.
- This type of substitution process is equivalent to performing character recognition instantaneously, instead of sequentially (as is done in alternative recognition technologies). Selected Words continue to be chosen from the Document Word List until all cipher labels that can be recognized have been recognized.
- Figure 18 is an example showing the use of the "Library Function". In this example there are 3 cipher labels yet to be identified remaining in the Selected
- the "Library Function” will eliminate all candidate plaintext words. In this case, the "Library Function” is not able to isolate the one correct answer from among the list of candidate plaintext words. In one aspect of the invention, however, to ensure that this one correct answer can be found during a full-text search of the document, a search engine will dynamically recreate and search the list of candidate plaintext words by using the Unrecognized Selected
- This Unrecognized Selected Word's cipher pattern will be referred to here as its Pattern Tag.
- Figure 19 depicts one embodiment of the invention in which a search engine would recreate and search a list of candidate plaintext words for the
- Pattern Tag under consideration in accordance with the present invention.
- the search engine of this embodiment of the present invention will recreate and search the list of candidate plaintext words generated.
- the search engine will use the plaintext Isomorphic Dictionary to recreate the list of candidate plaintext words fitting the criteria.
- the Input Search Word is found even though cipher labels remained unidentified in the document.
- the Raster-Plaintext Library could be used to eliminate some of the spurious candidate Plaintext words recreated from Pattern Tags stored in the scanned documents output file.
- a 'blob' (or connected-component) is defined as a group of non- white pixels (assuming a white image background) that touch one another. For example, all of the black pixels in the shape of the raster letter 'A'.
- an 'image' is defined as a raster image, perhaps generated by scanning a paper document, or it may be a pre-existing raster image.
- NST National Institutes of Standards and Technology
- Each blob is returned by the NIST algorithm as a rectangular raster image.
- the dimensions of the blob image depend on the size of the printed character and the resolution at which the page was scanned.
- a blob for the letter 'c' might have the image, as shown in Figure 20 where the '0' values are white (background) pixels and the 'l 's are black pixels defining the letter.
- the image of a second blob is either: (a) another instance of the first blob, or
- Exclusive OR (written 'XOR') is a mathematical operator that is used to compare pixels in binary images (i.e., images where pixels are either black or white). By overlapping the images of two blobs, the XOR operator returns a 'result' image that shows which pixels in the two images matched and which mismatched. In the XOR image, a '1' shows where pixels in the original images did not match and a '0' where they did. For example, consider the following two blob images shown in Figure 21. By examining the XOR result of pairs of blobs extracted from the image, it is possible to determine how closely they match one another.
- a pair of blobs matches after the XOR operation within a defined tolerance, the pair is deemed lo be instances of the same blob pattern. If the blobs mismatch, then further analyses is performed (see below) to determine whether or not they match.
- the histograms are simply measures of the number of black pixels in the image rows (in the vertical histogram) and columns (in the horizontal histogram). For example, the vertical and horizontal histograms for the two sample blobs shown in i.) above would be:
- blob 2 vertical histogram: 1 1 1 3 horizontal histogram:
- the pair of blob images is considered to be instances of the same blob, otherwise a potential mismatch is declared and further tests are applied to be certain.
- a scanline count is defined to be the number of transitions from a black pixel to a white pixel while moving from left to right (vertical scanline measure) and from top to bottom (horizontal scanline measure) along a given row/column of pixels within a blob image. When reaching the end of a scanline at the edge of the image, the next pixel, were it there, is considered to be white.
- the scanline counts for the two sample images above are:
- Scanline information is very useful, for example, in distinguishing between the letters 'o' and 'c' since the vertical scanline measure will highlight the gap on the right side of the image for 'c', the feature that distinguishes it from an 'o'. Similarly, the horizontal scanline can show the differences between an 'h' and a 'b'.
- the word structure information is stored in the Metafile (output file) so that the cipher words of the document are solved the moment the plaintext identities for the individual blobs are recognized.
- any punctuation symbol exhibit distinct typesetting behavior.
- any blob with an image that looks like a black circle, and where that blob lies on or close to a baseline (determined in 4 above) is considered to be a 'period' or 'decimal point'.
- Another example is a 'hyphen' which lies roughly midway between successive baselines and has a rectangular shape with a horizontal dimension considerably greater than its vertical one. Therefore by examining the relationships between blobs and baselines, it can be determined which blobs are punctuation symbols and which are not.
- a blob image comprises more than one touching character.
- the image in Figure 22 shows a 't' touching an 'n':
- Answer- Array contains exactly 1 copy of each disguise (or cipher label). 2.) The contents of Answer- Array should remain available, even after recognition subroutine has ended, to ease analysis of future text in document. The contents are deleted only after the entire document has been processed. 3.) The moment that a "plaintext identity" is entered into the "Plaintext” column, this identity information is made available throughout the software instantly (globally): nP -Array, nNP- Array, LA- Array, etc., everywhere.
- a sample Answer- Array is shown in Figure 24.
- Plaintext is that specific plaintext identity that is either 1.) determined uniquely from TW-Array analysis, or 2.) selected from "lower case” or "upper case” possibilities, using traditional methods. This traditional method is not used, however, until after the last line of the last page of the entire document has been processed. We strategically delay using this traditional method until the end because our hope is that, by finding future words containing the disguise in question, we can use the TW-Array analysis to more "confidently” uncover the plaintext identity of this disguise.
- the sample answer array for this example is shown in Figure 25.
- a tiny array called a "Letter- Array” will contain each unique "number” of the font that will be sent to the Recognition software.
- Letter-Array [ 1 2 4 7 ]. This implies that only words having disguises in fonts # 1, 2, 4, and 7, will be sent to the Recognition software.
- the threshold 13 is chosen because digits and associated symbols will almost never exceed 13 disguises, excluding punctuation [ - , ( ) . ] and symbol [ % ] ).
- this font can contain either "digits" or ALL Capital letter disguises. We must analyze this scenario further before deciding.
- Digit- Array [ 3 5 8 ]•
- ALLCap-Array [ 6 9 10 ].
- Collection II i.e., an acceptable disguise (for a Capital letter) can only be preceded by another Capital letter or punctuation mark.
- Collection ⁇ (Final) [ k E ]
- Each disguise in the final form of Collection II "may” represent an additional disguise for a possible Capital-letter.
- the "overall” text stream can contain words from multiple fonts. Separate the "overall” text stream into “j” new text streams. There will be one text stream for each font # listed in the Letter- Array and the ALLCap-Array. Text Stream #1, Text Stream #2, ... , Text Stream #j. Here, Text Stream
- Text Stream #2 contains "only” words having all of their letters in font #2, etc.
- Each Text Stream will be analyzed separately in the Recognition software. After each Text Stream is finished being processed by the Recognition software, sequentially input the next Text Stream into the Recognition software until ALL text in the document has been processed.
- Any 2-letter pattern words must be an Acronym, since no 2-letter intelligible pattern words exits in English.
- Non-Pattern 1 -letter words are excluded, since they cannot be analyzed with pattern word software.
- Skip ALL 1 -letter non-pattern words Do not read them in! Only 1 copy of each pattern or non-pattern word goes into each of the nP-Arrays or nNP-Arrays.
- All pattern words "tagged” as containing at least 1 "segmented pair of characters" are stored as the last entries in each nP-Array.
- All non-pattern words "tagged” as containing at least 1 "segmented pair of characters" are stored as the last entries in each nNP-Array.
- TW-Array Trial Word- Array
- non-pattern words are selected only after ALL pattern words, containing at least 1 disguise that is the same as a disguise in the Selected Word, have been exhausted. See Figure 27.
- Isomo ⁇ h Isomorphic- Array, if initial letter of word is "Capital"
- Words are not, however, put in the LA- Array. At this point, we choose another Selected Word (if one exists) and continue our analysis. Whenever a word is put into the LA- Array, it must be deleted from its original nP- or nNP -Array.
- Possibility (1) can arise only if there exist at least 1 "tagged” word in the TW-Array, and All Rows contain a "0" entry.
- scenario (2) we move immediately to the Resegmentation Algorithm discussed below, for further analysis, with the pre-existing words in TW-Array intacted. If, however, no "tagged” word exist, then we are dealing with scenario (2), (3), (4), (5), or (6). In this case, we put the Selected Word and its associated Trial-Words into the LA- Array. Within the LA- Array, the 1 of 5 specific scenarios that we are presently faced with will be determined, and referred to the appropriate algorithm for further analysis.
- the incorrectly segmented disguises after they are resegmented correctly, may now have identities in "Plaintext" column, since those correct disguises may have appeared in other words.
- the disguises will always remain disguises, until their plaintext identities are uncovered with a Broken- Array analysis.
- the pattern word software can be used to analyze disguised variant words, where the variant-cipher letter may create 2 disguises for the same plaintext letter within the same word, even though the software was not designed to handle this scenario.
- Isomorphic Array ⁇ Variant word Number of variant letters (1 or 2), location of variant #1 (from left), location of variant #2 (from left) ⁇ .
- I ⁇ ZntHmHdatHonl ZntHmHdatHon AntAmAdatBon
- the Bounding Box 5 for A
- I ⁇ ZntHmHdatHon, 1 ,_ , 1 ⁇ is equivalent to the fully expanded array 1 '
- this initial letter Z must necessarily be the disguise for the plaintext Capital-letter represented by the disguise H'
- AntAmAdatAon (and others)
- Pattern Tags containing 1 cipher label is treated just like a Capital Letter.
- the search engine will put this Pattern Tag into the Pattern word software using the Isomorphic Array
- Pattern software automatically prevents different disguises, within the same word and each representing lower-case letters (Boxes not equal to 5), from having the same plaintext identities.
- Isomorphic-Array which uses pattern word software, does not prevent the Capital letter and a lower-case letter, within the same word, from having the same plaintext identities.
- a Sample TW-Array is shown in Figure 32.
- pattern word software automatically prevents different disguises representing lower-case letters, within the same word, from having the same plaintext identities.
- Trial-Word #2 C a D B i. "B" in Trial-Word #2 has the same plaintext identity as "B” in Selected-Word, for each Row separately, ii. Neither C nor D is allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext” column, for font #j. Note that lower-case "a” is already in "Plaintext” column, (pattern word software automatically enforces the restriction: B cannot equal C cannot equal D cannot equal a) iii. Neither C nor D can equal A, since C and D cannot have the same plaintext identity as a different disguise, also representing a lower-case letter (Box not equal to 5), in the Selected-Word.
- Trial-Word #3 I ⁇ C A D E e ⁇ i. "A" in Trial-Word #3 has the same plaintext identity as "A” in Selected- Word, for each Row separately. ii. C is not allowed to have the same plaintext identity as a
- Trial-Word #4 C D A E F i. "A" in Trial-Word #4 has the same plaintext identity as "A” in Selected-Word, for each Row separately, ii. Neither C_,_D, E, nor F is allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext" column, for font #j. (pattern word software automatically enforces the restriction: A cannot equal C cannot equal D cannot equal E cannot equal F) iii.
- Trial-Word #5 I ⁇ C A D E F ⁇ i. "A" in Trial-Word #5 has the same plaintext identity as "A” in Selected-Word, for each Row separately. ii. C is not allowed to have the same plaintext identity as a
- Resegmentation is required when there exist at least 1 "tagged" word in the TW- Array, and ALL Rows in contain a "0" entry.
- the Resegmentation algorithm When the Resegmentation algorithm is called, it must go to work on a TW-Array with "pre-existing" Selected Word and/or Trial-Words. There are two cases that we must consider: Note: The instant when Resegmentation of a pair of letters occurs, transmit this information immediately throughout the entire document: All Algorithms and All Arrays! All "tagged” words containing "j" incorrectly segmented pairs will have their "word-lengths" reduced by "j” letters as well, after Resegmentation has occurred.
- the Selected Word is the 1 st "tagged" word encountered in TW-Array.
- Selected Word is a "tagged” word, we eliminate immediately All preexisting Trial-Words from TW-Array, and choose new Trial-Words after Resegmentation of Selected Word is completed. This is done since all Trial-
- Trial-Word #j is a "tagged" word, we eliminate immediately All Trial-Words that follow Trial-Words #j in the TW-Array. if any exist (i.e.. we eliminate Trial-Words #j+l. Trial-Words #j+2. . . .). Now, perform Resegmentation on Trial-Words #j itself. If, after Resegmentation, Trial-Words #j still contains at least 1 disguise identical to a disguise in
- Trial-Words #j no longer contains any disguises identical to a disguise in Selected Word, then remove this word from the TW-Array (through not from nP- or nNP-Arrays). choose a new Trial-Word, and proceed as usual. We continue choosing Trial-Words until a unique plaintext answer emerges for Selected Word, or the TW-Array analysis fails again, etc. Resegmentation is performed as described in 1(a) and (b).
- a word "suspected" of containing 1 "pair” of touching letters may, in fact, just contain a "single letter", where the word containing it was tossed out of the TW-Array because there was no Trial-Word to verify its identity (perhaps because disguise is a low frequency letter). We have to allow for both possibilities.
- each plaintext word in font #j, which contains a Resegmented problem letter [m, w, d, or W] must be verified as a legitimate English word, by the pattern word software.
- the Metafile should be able to discriminate between the relative importance of words on each page of a document.
- An efficient database search engine will not allow users to search for common words. These words, when entered on the command line, will be ignored.
- the Metafile can be designed with an extremely efficient data structure for indexing words on each page of a document, with common words given the lowest priority. This will increase tremendously the speed with which database searches can be accomplished.
- the 100 most common words in English are listed below. The figures give occurrences in 242,432 words.
- Pattern Words (4) Common Non-Pattern Words.
- the words in each group are ordered alphabetically, and according to word-length. Each word has pointers to every x-y location, and line, on each page where it occurs within the document. Each word may also be expressed in many different fonts depending on its x-y location. The main benefit to this ordering is that the database engine is only required to uncompress and search the very beginning of the Metafile, and not the entire document, to know everything contained in that document.
- Figure 35 shows an example of an array grouping of common word (non-pattern and pattern).
- ALL Arrays nP-, nNP-, Answer-, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Character Discrimination (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU49150/97A AU4915097A (en) | 1996-10-16 | 1997-10-15 | Isomorphic pattern recoginition |
Applications Claiming Priority (10)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US2864996P | 1996-10-16 | 1996-10-16 | |
US2857896P | 1996-10-16 | 1996-10-16 | |
US2857596P | 1996-10-16 | 1996-10-16 | |
US60/028,649 | 1996-10-16 | ||
US60/028,575 | 1996-10-16 | ||
US60/028,578 | 1996-10-16 | ||
US08/946,680 US6275610B1 (en) | 1996-10-16 | 1997-10-08 | File structure for scanned documents |
US08/946,680 | 1997-10-08 | ||
US08/950,287 | 1997-10-14 | ||
US08/950,287 US6094484A (en) | 1996-10-16 | 1997-10-14 | Isomorphic pattern recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1998016897A1 true WO1998016897A1 (fr) | 1998-04-23 |
Family
ID=27534239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1997/019130 WO1998016897A1 (fr) | 1996-10-16 | 1997-10-15 | Reconnaissance de formes isomorphes |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU4915097A (fr) |
WO (1) | WO1998016897A1 (fr) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5384863A (en) * | 1991-11-19 | 1995-01-24 | Xerox Corporation | Methods and apparatus for automatic modification of semantically significant portions of a document without document image decoding |
US5410611A (en) * | 1993-12-17 | 1995-04-25 | Xerox Corporation | Method for identifying word bounding boxes in text |
US5455871A (en) * | 1991-11-19 | 1995-10-03 | Xerox Corporation | Detecting function words without converting a scanned document to character codes |
US5539841A (en) * | 1993-12-17 | 1996-07-23 | Xerox Corporation | Method for comparing image sections to determine similarity therebetween |
US5557689A (en) * | 1991-11-19 | 1996-09-17 | Xerox Corporation | Optical word recognition by examination of word shape |
US5642435A (en) * | 1995-01-25 | 1997-06-24 | Xerox Corporation | Structured document processing with lexical classes as context |
US5689620A (en) * | 1995-04-28 | 1997-11-18 | Xerox Corporation | Automatic training of character templates using a transcription and a two-dimensional image source model |
-
1997
- 1997-10-15 AU AU49150/97A patent/AU4915097A/en not_active Abandoned
- 1997-10-15 WO PCT/US1997/019130 patent/WO1998016897A1/fr active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5384863A (en) * | 1991-11-19 | 1995-01-24 | Xerox Corporation | Methods and apparatus for automatic modification of semantically significant portions of a document without document image decoding |
US5455871A (en) * | 1991-11-19 | 1995-10-03 | Xerox Corporation | Detecting function words without converting a scanned document to character codes |
US5557689A (en) * | 1991-11-19 | 1996-09-17 | Xerox Corporation | Optical word recognition by examination of word shape |
US5410611A (en) * | 1993-12-17 | 1995-04-25 | Xerox Corporation | Method for identifying word bounding boxes in text |
US5539841A (en) * | 1993-12-17 | 1996-07-23 | Xerox Corporation | Method for comparing image sections to determine similarity therebetween |
US5642435A (en) * | 1995-01-25 | 1997-06-24 | Xerox Corporation | Structured document processing with lexical classes as context |
US5689620A (en) * | 1995-04-28 | 1997-11-18 | Xerox Corporation | Automatic training of character templates using a transcription and a two-dimensional image source model |
Also Published As
Publication number | Publication date |
---|---|
AU4915097A (en) | 1998-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Parhami et al. | Automatic recognition of printed Farsi texts | |
Fujisawa et al. | Segmentation methods for character recognition: from segmentation to document structure analysis | |
US6047251A (en) | Automatic language identification system for multilingual optical character recognition | |
Ma et al. | Joint layout analysis, character detection and recognition for historical document digitization | |
Bansal et al. | Integrating knowledge sources in Devanagari text recognition system | |
US5491760A (en) | Method and apparatus for summarizing a document without document image decoding | |
US5943443A (en) | Method and apparatus for image based document processing | |
US6950533B2 (en) | Sorting images for improved data entry productivity | |
US6721451B1 (en) | Apparatus and method for reading a document image | |
US5067165A (en) | Character recognition method | |
US5745600A (en) | Word spotting in bitmap images using text line bounding boxes and hidden Markov models | |
US5438630A (en) | Word spotting in bitmap images using word bounding boxes and hidden Markov models | |
US5825919A (en) | Technique for generating bounding boxes for word spotting in bitmap images | |
US20040006467A1 (en) | Method of automatic language identification for multi-lingual text recognition | |
JPH10507025A (ja) | 走査された及びリアルタイムの手書き文字の識別を行う文字認識システム | |
WO2009070032A1 (fr) | Procédé de traitement de données de reconnaissance optique de caractères (ocr), la sortie comprenant des images de caractères visuellement altérées | |
Pal et al. | OCR in Bangla: an Indo-Bangladeshi language | |
JPH07200744A (ja) | 判読困難な文字の識別方法及び装置 | |
Kim et al. | Automated labeling in document images | |
Duygulu et al. | A hierarchical representation of form documents for identification and retrieval | |
US6094484A (en) | Isomorphic pattern recognition | |
Gopisetty et al. | Automated forms-processing software and services | |
Lladós et al. | Indexing historical documents by word shape signatures | |
Lebourgeois et al. | Document analysis in gray level and typography extraction using character pattern redundancies | |
EP0602955B1 (fr) | Reconnaissance de texte |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH HU ID IL IS JP KE KG KP KR KZ LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW AM AZ BY KG KZ MD RU TJ TM |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH KE LS MW SD SZ UG ZW AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: CA |