Current location - Training Enrollment Network - Books and materials - How to convert an article scanned by a scanner (handwritten) into word format?
How to convert an article scanned by a scanner (handwritten) into word format?
Scan the text, and the results are saved in the computer in the form of pictures. bmp)。 Then use ORC recognition system to convert, and finally use WORD to modify and edit. Here's how to use ORC:

OCR is the abbreviation of English optical character recognition. Translating into Chinese is to recognize characters with optical technology, which is an important aspect of the research and application of automatic recognition technology. It is a software technology that can automatically recognize characters and input them into the computer, and it is the main software supporting the scanner. It belongs to the category of non-keyboard input and needs the cooperation of image input equipment, mainly scanners. At present, OCR mainly refers to character recognition software. Before 1996 Ziguang began to match Chinese recognition software, scanners and OCR software on the market were always sold separately. Professional OCR software "Chinese character recognition software"? Hanging in the air? Frog k Widow School? Do you owe Zhiyuan real milk? CR software is also constantly upgrading, and scanner manufacturers have now sold professional OCR software and brought their own scanners. The rapid development of OCR technology is closely related to the wide use of scanners. In recent two years, with the gradual popularization of scanners and the improvement of OCR technology, OCR has become the right-hand man for most scanner users.

I. Development of OCR technology

Since the first generation of OCR products appeared in the early 1960s, after more than 30 years of continuous development and improvement, the research on various OCR technologies, including handwriting, has made remarkable achievements. The functional requirements of OCR products have also changed from paying attention to recognition rate to putting forward higher requirements for recognition speed, user-friendly interface, simplicity of operation, product stability, adaptability, reliability and easy upgrade, and pre-sales and after-sales service quality.

The first OCR product was developed by IBM. 1965, the OCR product of IBM-IBML 287 was exhibited in new york World Expo. At that time, this product could only recognize printed numbers, English letters and some symbols, and it must be a designated font. In the late 1960s, Hitachi and Fujitsu also developed their own OCR products. The world's first automatic letter sorting system for recognizing handwritten postal codes was developed by Toshiba Corporation of Japan, and the same system was introduced by NEC Corporation two years later. By 1974, the automatic sorting rate of letters reached about 92%, which was widely used in the postal system and played a good role. In 1983, Toshiba Corporation of Japan released its OCRV595, which is an OCR system for recognizing printed Japanese characters. The recognition speed is 70 ~ 100 Chinese characters per second, and the recognition rate is 99.5%. Later, Toshiba began the research work of handwritten Japanese characters recognition.

The research on OCR technology in China started late. In the 1970s, it began to study the recognition technology of numbers, English letters and symbols, and in the late 1970s, it began to study the recognition of Chinese characters. From 65438 to 0986, the National 863 Program in the field of information organized Tsinghua University, Beijing Institute of Information Technology and Shenyang Institute of Automation to jointly develop Chinese OCR software. By 1989, Tsinghua University took the lead in launching the first set of Chinese OCR software in China-Tsinghua Wentong TH-OCR 1.0, and Chinese OCR officially went from the laboratory to the market. Tsinghua OCR printed Chinese character recognition software later introduced TH-OCR 92 high-performance and practical simplified/traditional, multi-font and multifunctional printed Chinese character recognition system, which made great progress in printed Chinese character recognition technology. 1994, a high-performance Chinese-English mixed printed character recognition system, was evaluated by experts as "the first Chinese-English mixed printed character recognition system at home and abroad, which is generally at the international leading level". In the middle and late 1990s, the Department of Electronic Engineering of Tsinghua University put forward and carried out comprehensive research on Chinese character recognition, and made important achievements in the fields of printed characters, on-line handwritten Chinese character recognition, off-line handwritten Chinese character recognition and off-line handwritten digital symbol recognition. The representative achievement is TH-OCR 97 integrated Chinese character recognition system, which can recognize and input printed texts, online handwritten Chinese characters, offline handwritten Chinese characters and handwritten numbers in multiple languages (Chinese, English and Japanese). In recent years, apart from Tsinghua Wentong TH-OCR, other OCR softwares with different styles, such as Shangshu SH-OCR, have come out one after another, and the Chinese OCR market has steadily expanded, with users all over the world.

It can be said that the recognition technology of printed OCR has reached a very high level. OCR products have only been able to identify designated printed numbers, English letters and some symbols in the early days, and have developed into a powerful tool for rapid computer information input, which can automatically analyze the layout and identify tables, and realize the identification of mixed characters, multiple fonts, multiple font sizes and vertical and horizontal mixed rows. The recognition rate of printed Chinese characters is above 98%, and even the recognition rate of words with poor printing quality is above 95%. It can recognize the simplification and simplification of many fonts, such as Song Dynasty, bold, regular script and imitation Song Dynasty, and can recognize the mixed typesetting of many fonts and different font sizes. The recognition rate of handwritten Chinese characters is over 70%. Especially after more than ten years' efforts, the Chinese character OCR technology in China has overcome the difficulties of late start and huge Chinese character set, and the speed of character recognition (referring to the number of words from feature extraction to recognition result output in unit time) can reach more than 70 words/second. Due to the maturity of printed OCR Chinese character recognition technology, OCR products are widely used in news, printing, publishing, library, office automation and other industries.

Professional OCR products are mostly oriented to specific industries, that is, they are suitable for departments that need to process a lot of form information input every day, such as postal services, taxation, customs, statistics and so on. This professional OCR system for a specific industry has a relatively fixed format and a relatively small character set, and is often used in combination with special input devices, so it has the characteristics of high speed and high efficiency, such as an automatic mail sorting system.

Handwritten manuscript recognition products only came out in 1996 and 1997, and are provided as additional functions of printed manuscript recognition products. Because people's writing habits are very different, it is quite difficult to realize free handwriting recognition. Therefore, the application field of handwritten OCR technology is online handwriting recognition, that is, human handwriting and computer recognition, which is a real-time recognition method.

Second, the basic principle of OCR

Simply put, the basic principle of OCR is to input the image of the manuscript into the computer through the scanner, and then the computer takes out the image of each character and converts it into the code of Chinese characters. Its specific working process is that the scanner converts the optical signal of Chinese character manuscript into electrical signal through CCD, and then converts it into digital signal through A/D converter and transmits it to the computer. The computer accepts the digital images of the manuscript, and the Chinese characters on the images may be printed Chinese characters or handwritten Chinese characters, and then recognizes the Chinese characters in these images. For printed characters, the document data is first converted into the original black-and-white dot matrix image file by optical means, and then the characters in the image are converted into text format by recognition software for further processing by word processing software. Among them, character recognition is an important technology of OCR.

Two ways of 1.OCR recognition

Like other information data, the graphic information captured by all scanners in the computer is recorded and identified by two numbers: 0 and 1, and all information is just a series of points or sample points saved by 0 and 1. OCR recognition program mainly recognizes the character information on the page through cell pattern matching and feature extraction.

Pattern matching is to loosely compare each character with the file by using standard font and font size bitmap. If there is a large database that stores characters in the application, the application will select the appropriate characters to match correctly. Software must use some processing techniques to find the most similar match, usually by constantly trying different versions of the same character to compare. Some software can scan a page of text and identify every character that defines a new font. Some software uses its own recognition technology to recognize the characters on the page as much as possible, and then manually select or directly input the unrecognized characters.

Feature extraction is to decompose each character into many different character features, including diagonal lines, horizontal lines and curves. Then, these features are matched with the understood (recognized) characters. For a simple example, if an application recognizes two horizontal lines, it will "think" that this character may be "two". The advantage of feature extraction method is that it can identify a variety of fonts. China's calligraphy, for example, is realized by feature extraction.

Most OCR applications have added the function of grammar intelligent checking, which further improves the recognition rate. It mainly corrects spelling and grammar through context checking. In character recognition, OCR application will do a lot of contextual cohesion checking, and check the words of string according to the existing phrases and fixed word order in the program. A more advanced application software will automatically replace the wrong word with the word it "thinks" to correct the meaning of the sentence.

2. Several steps of character recognition

Character recognition includes the following steps: graphic input, preprocessing, word recognition and post-processing.

(1) graphic input

It refers to inputting a document into a computer through an input device, that is, digitizing the manuscript. Now the widely used equipment is scanner. The scanning quality of document images is a prerequisite for correct recognition by OCR software. Correct selection of scan resolution and related parameters is the key to ensure clear characters and no loss of features. In addition, the document should be placed as correctly as possible to ensure that the tilt angle detected by preprocessing is small and the deformation of the text image after tilt correction is small. These simple operations will improve the recognition accuracy of the system. On the other hand, due to improper scanning settings, too many broken pens may separate half of the text images. Some features will be lost due to broken pen and adhesion of strokes. When comparing features with feature database, the feature distance will increase and the recognition error rate will increase.

(2) Pretreatment

Scan an image of a simple printed document, sort out each character image and give it to the recognition module for recognition. This process is called image preprocessing. Pretreatment refers to some preparatory work before character recognition, including image purification and removing obvious noise (interference) in the original image. The main task is to measure the tilt angle of the document, analyze the layout of the document, confirm the layout of the selected text field, divide the text lines in the horizontal and vertical layout, separate the text images in each line and distinguish punctuation marks. The work at this stage is very important, and the effect of processing directly affects the accuracy of character recognition.

Layout analysis is the overall analysis of text images, which is to sort out all the text blocks in the document, distinguish the text paragraphs and typesetting order, and the areas of images and tables. The domain boundary of each text block (the coordinates of the starting point and ending point of the domain in the image), the attributes in the domain (horizontal and vertical layout) and the connection relationship of each text block are provided as data structures to the recognition module for automatic recognition. Direct recognition of text area, special analysis and recognition of table area, compression or simple storage of image area. Line word segmentation is the process of cutting a large image into lines first, and then separating a single character from the image lines.

(3) word recognition

Single character recognition is the core technology of OCR character recognition. It is the key to let the computer "recognize words", that is, the so-called recognition technology, and convert the graphics and images of text images detected from scanned texts into standard codes of texts. Just as the human brain knows words because it preserves various characteristics of words, such as the structure of words and the strokes of words. In order for the computer to recognize characters, it is necessary to store characters and other information in the computer first, but what information to store and how to obtain it is a very complicated process, and it needs to achieve a very high recognition rate to meet the requirements. The usual practice is to analyze characters according to their strokes, feature points, projection information and regional distribution of points.

There are thousands of Chinese characters commonly used in China, and the recognition technology is the feature comparison technology. By comparing with the recognition feature database, the word with the most similar feature is found and the standard code of the word is extracted, which is the recognition result. Comparison is a basic way for people to know things, and Chinese character recognition is also to find out the similarities, similarities and differences between Chinese characters through comparison, and to grasp the relationship between quantity and quality, as well as the relationship between time and space. For Chinese characters with large character set, multi-level classification, multi-feature and omni-directional dynamic matching are generally used to find similar sets to ensure high classification rate, strong adaptability and good stability. The focus of fine classification is similarity matching, weighting processing, structural discrimination, quantitative and qualitative analysis, and the relationship between connectives, and finally discrimination. Chinese character recognition is essentially the application of comparative science or cognitive science in artificial intelligence, and its key technology is recognition feature base. Only with such a feature library can the computer complete the function of word recognition.

In the layout of image documents, there are not only words and pictures, but sometimes tables. In order to digitize the identified table, special treatment is needed for the table fields in the process of layout analysis, including extracting the structural information of the table rows, sorting the text fields in the table, identifying the table rows and text fields, and generating different file formats according to the digitization of the table rows. Because the tables in the document are arbitrary, diverse, closed and open, especially the diagonal lines in the tables, it is difficult to analyze the tables.

(4) Post-treatment

Post-processing refers to matching the recognized words or multiple recognition results up and down in the form of phrases, that is, segmenting the word recognition results and comparing them with the phrases in the thesaurus, thus improving the recognition rate of the system and reducing the false recognition rate.

Chinese character recognition is the most difficult problem in the field of character recognition, involving pattern recognition, image processing, digital signal processing, natural language understanding, artificial intelligence, fuzzy mathematics, information theory, computer, Chinese information processing and other disciplines, and it is a comprehensive technology. In recent years, the correct recognition rate of printed Chinese character recognition system has exceeded 95%. In order to further improve the overall recognition rate of the system, the scanning image, image preprocessing and post-recognition technology have also been deeply studied, and great progress has been made, effectively improving the overall performance of the printed Chinese character recognition system. Tsinghua University has made remarkable achievements in this field and become one of the most authoritative institutions in the world. At present, all the scanners of Ziguang are equipped with Tsinghua OCR Millennium Edition software, which has reached a high level in recognition rate, table recognition and even standardized handwriting recognition.

Third, OCR text recognition skills

In recent years, with the popularity of scanners, OCR recognition technology has developed rapidly, and the performance of scanning recognition software has been continuously strengthened and upgraded to intelligence. However, if you want to get the correct scanning results quickly and get efficient text input, you must seriously study relevant knowledge and combine practical experience to find out your own complete set of solutions. Sometimes when we do character recognition, the recognition rate is very low, which can't reach more than 95% as the software says. Please don't blame the hardware or software first. In fact, this is why we haven't mastered the skills of scanning and OCR recognition.

The following are some methods and techniques commonly used in character recognition operations.

The resolution setting of 1. is an important prerequisite for character recognition. Generally speaking, scanners provide more image information, and recognition software can easily get recognition results. However, the higher the scan resolution is set, the higher the recognition accuracy will be. Choose a resolution of 300dpi or 400dpi, which is suitable for scanning most documents. Pay attention to the scanning recognition of the original text, and do not exceed the optical resolution of the scanner when setting the scan resolution, otherwise the loss will outweigh the gain. The following are some typical settings for reference only.

(1) 1, 2,3, 200dpi is recommended.

(2) Small paragraphs 4 and 5 suggest 300dpl.

(3) 400dpl is recommended for paragraphs 5 and 6 with small numbers.

(4) It is recommended to use 600dpi in paragraphs 7 and 8.

2. Adjust the brightness and contrast values properly when scanning to make the scanned document black and white. This is the key to the recognition rate. The setting of scanning brightness and contrast value is based on the principle of observing the fine strokes of Chinese characters in scanned images but not stopping. Before recognition, look at the quality of the words in the scanned image. If there are black spots or black spots in the image or the lines of words are thick and dark, and the strokes are unclear, it means that the brightness value is too small, so you should increase the brightness value and try again. If the text lines in the image are uneven, broken or even the outline of Chinese characters is seriously incomplete, it means that the brightness value is too large, so you should reduce the brightness and try again.

3. Select scanning software. Choosing a good OCR software that suits you is the basis of doing a good job of character recognition. Generally, you should not use the OEM software that comes with the scanner. OEM OCR software has few functions and poor effects, and some even have no Chinese recognition. After comparison, I think that the automatic text recognition and input system of Ziguang OCR2003 Professional Edition and Shangshu OCR6.0 is more outstanding in recognition ability and use function. Select another image software. Does OCR software have no scanning interface? Why are you looking for image software? First, OCR software cannot identify all scanners; Second, and most importantly, the images scanned by the scanning interface of the imaging software are easy to process; Generally choose PHOTOSHOP.

4. If the text is to be formatted, such as bold, oblique, indented first line, etc. Some OCR software will not recognize it, and the format will be lost or garbled. If you must scan formatted text, make sure that the recognition software you use supports text format scanning in advance. You can also turn off the pattern recognition system, so that the software can concentrate on finding the correct characters, regardless of the font and font format.