In fact, as long as you pay attention to the following points, you can significantly improve the OCR recognition rate: First, choose the appropriate scanning resolution. A scanning resolution that is too low will often lead to a decrease in OCR recognition rate, while a resolution that is too high will make the image file too large and reduce the recognition speed. In actual operation, the operator can judge the acceptability by looking at the number of red typos (such as less than 3) in the text generated after OCR recognition, and decide whether to scan at this resolution for OCR recognition. The second is to scan in black and white binary mode as much as possible. When scanning documents with a scanner, OCR usually accepts gray or black and white binary modes, but not color modes. If the printing quality of the manuscript is good, gray mode can be used, otherwise black and white binary mode should be used. When scanning, you can manually adjust the size of the black and white threshold. If the text outline on the black and white binary image is incomplete, increase the threshold appropriately. If the outline of the text is too thick, it means there is more information redundancy, and the threshold can be appropriately lowered. The black and white binary scan image formed after such adjustment can achieve better OCR recognition effect. Third, pay attention to the tilt correction of characters when performing OCR recognition. OCR recognition allows documents to be slightly tilted, but excessive tilting will affect the recognition rate. The correction method is to click the tilt correction button on the scanning software, and the recognition software will automatically correct the image before OCR recognition. The fourth is preprocessing before manuscript identification. Remove clutter and pictures from the manuscript, because clutter will interfere with text recognition, pictures cannot be recognized, and will affect OCR text segmentation. For the columns in the manuscript, it is recommended to manually set the column area, that is, use multiple boxes to select the text to be recognized, and then perform OCR recognition. The fifth is to adopt appropriate identification methods. Simplified and traditional manuscripts are mixed, and the recognition rate of Chinese and English manuscripts is often low. If Simplified and Traditional Chinese and English are distributed in blocks, you can use image processing software to edit different text blocks into files with similar text blocks, and then use OCR to identify different characters. (5) Scanning registration carefully fills in the paper document digital conversion process handover registration form, registers the number of scanned pages, and checks whether the actual number of scanned pages of each document is consistent with the number of document pages filled in when filing. If there is any inconsistency, the specific reasons and handling methods should be noted. 3. Image processing After the scanning is completed, the obtained image must be technically processed as required to correct the deviation between the scanned file and the original file, making the scanned file clearer and more standardized. Image processing generally includes the following contents: (1) Image data quality inspection checks the deflection, clarity and distortion of the image. If quality requirements are found not to be met, the image should be reprocessed. When the scanned image file is incomplete or cannot be clearly identified due to improper operation, it should be scanned again; if there are missing scanned files, scan them in time and insert the images correctly; when it is found that the order of the scanned images is inconsistent with the original file, it should be adjusted in time . Fill in the relevant forms carefully and record the quality inspection results and processing opinions. (2) Rectification should correct the deflection image so that the deflection is not visually felt. Pictures with incorrect orientation should be rotated and restored to conform to reading habits. (3) Decontamination Impurities that affect image quality, such as black dots, black lines, black frames and black edges, should be removed. During processing, care should be taken not to destroy the original information of the file. (4) Image stitching Multiple images formed by scanning large-format documents in different areas should be stitched and merged into a complete image to ensure the integrity of the digital image of the document. (5) Trimming Images scanned in color mode should be trimmed to remove excess white edges to effectively reduce the size of image files and save storage space. The above rectification, decontamination, trimming and other processes can be completed manually with the naked eye. You can also use specially designed software to make certain settings in advance, and then the computer will automatically process it. Computer processing is certainly efficient, but not as flexible as manual processing. For example, once the size of the stain is designed to be too small, the computer will automatically remove some punctuation marks as stains. Therefore, the processing of scanned images also requires a combination of manual and automatic processing. 4. Image storage (1) Storage format Image files scanned in black and white binary mode are usually stored in TIFF (G4) format. Image files scanned in gray mode and color mode are usually stored in JPEG format. The choice of compression ratio during storage should be based on minimizing storage capacity while ensuring the legibility of scanned images.Scanned images are provided for network query and can also be stored as files in CEB, PDF or other formats. (2) Naming of image files Digital archive resources should be named with file numbers or unique identifiers. If digital archive resources are named with file numbers and sorted by volume, the file number should be compiled according to the "Rules for Preparation of File Numbers" (DA/T 13-1994). It is recommended to add the file category code as a sub-item of the category number; if sorted by piece , the case file number can adopt the structure of "full case number-case file category code year-storage period-organization (issue) code-piece number-partition number". 5. Directory database construction (1) Data format selection: A common data format should be selected for directory database construction, and the selected data format should be able to exchange data directly or indirectly through XML documents. The establishment of this database can be entered through a special file management system or scanning processing management software, or it can be entered through a specially designed file directory table of EXCEL, and then the data can be imported into the file management system. (2) Archive description According to the requirements of the "Archives Description Rules" (DA/T18-1999), establish an archive catalog database and enter archive catalog data. (3) Quality inspection of catalog data In order to ensure the accuracy of the data, the method of "single-machine entry-manual proofreading" or "double-machine entry-computer automatic proofreading" can be used. Whether it is manual proofreading or computer proofreading, it is necessary to check whether the description items are complete and whether the description content is standardized and accurate. If unqualified data is found, it should be modified or re-recorded. 6. Data hooking (1) Summary hooking The cataloging database and image files formed during the digital conversion process of archives will be loaded to the data server through the network in a timely manner for summary after passing the quality inspection. Avoid slow and error-prone manual mounting of directory databases and image files, and try to use computer automatic mounting in batches. As long as the scanned digital files are named according to the file number of the paper document, the automatic search of relevant digital images and the addition of corresponding electronic address information can be achieved by compiling a hooking program or using corresponding software, thereby achieving batch and rapid hooking. (2) Data association is based on the paper document catalog database, and one or more images scanned from each paper document are stored as image files. When storing image files in the corresponding folders, you need to carefully check whether the name of each image file is the same as the file number in the archive directory database, and whether the page number of the image file is the same as the file page number in the archive directory database. Whether the total number of files is the same as the number of files in the archive catalog database. The file name of each image file is used to establish a one-to-one correspondence with the file number of the file in the archive directory database, which provides conditions for automatic batch connection of the archive directory database and image files. (3) Handover Registration Carefully fill in the handover registration form for the digital conversion process of paper documents, record the number of pages after data association, and check whether the number of pages after each file association is consistent with the number of pages filled in during document sorting and scanning. If there is any inconsistency, the specific reasons and handling methods should be noted. 7. Data acceptance checks the overall quality of all sampled digitized data, including catalog databases, image files, and data hooks. When there is an error in the link between the catalog database and the image file, or when one of the catalog database and the image file is incomplete, unclear, or has errors, the random inspection will be marked as "unqualified". A complete document will be accepted as "passed" when the pass rate of the digital conversion quality sampling inspection reaches 95 or more (inclusive). Qualification rate = number of documents passing sampling inspection/total number of documents passing sampling inspection × 100. Carefully fill in the digital acceptance registration form for paper files. The conclusion of "passed" acceptance must be reviewed and signed before it can take effect. 8. Data backup Complete and qualified data should be backed up in a timely manner. To ensure data security, the choice of backup carriers should be diversified. Multiple sets of backups can be achieved using a combination of online and offline methods, and attention should be paid to remote storage. Backup data should also be checked. The inspection content of backup data mainly includes whether the backup data can be opened, whether the data information is complete, and whether the number of files is accurate. After data is backed up, the corresponding backup media should be marked for easy search and management. Fill out the digital backup management registration form for paper documents. 9. Digital results management should strengthen the management of digital results of paper archives to ensure their security, integrity and long-term availability.
When providing online retrieval and utilization of digital results of paper archives, there should be an electronic identification of the production unit, and a downloadable or non-downloadable data format should be used depending on the specific situation.