OCR

Optical character recognition (OCR) is the process of taking image files (e.g., scanned from documents) and electronically converting them into searchable text.

Automatic OCR on PDFs During Upload

Documents will go through automatic OCR processing if the Perform OCR on PDFs checkbox is checked. This is the default setting. If unchecked, OCR is not performed. If documents have already been OCRed, they will remain OCRed. The Lexbe eDisocovery Platform (LEP) does not remove OCR. Generally, leaving Perform OCR checked is preferable. It is possible that the original and the normalized PDF will have slightly different OCR results. There is no charge for the additional OCR.

OCR of Single Page TIFFs

When OCRing supported images files (e.g., TIFF images), LEP creates a corresponding text file containing OCR text. For example, if a TIFF image is named SMITH 00000100.TIF, the corresponding text file containing the results of OCR will be named SMITH 00000100.TXT. Single page TIFFs are not OCRed in LEP.

OCR of Image PDFs

LEP uses OCR technology to convert image-only PDFs into text-under-image PDFs. The original document image (or scan) is saved and the text is added to the file in a hidden layer, so the document can be searched (copy and paste is available). The appearance of the document remains unchanged. LEP OCRs all image-based PDFs (discarding prior OCR text and replacing with new OCR).

OCR Settings

LEP settings for OCR are as follows: deskew, autorotate, and despeckle. These are general settings and not always 100% successful (as is the case for any automatic OCR software). In particular, the autorotate function sometimes fails to properly rotate a document. OCR is set to recognize English only (unless International OCR is ordered as part of an eDiscovery job).

Identification of Documents to OCR or re-OCR

LEP will OCR all image-based PDFs (discarding prior OCR text and replacing), unless No OCR is selected at the time of upload.

OCR Limitations

OCR is an automated service. There is no manual review as part of this service.

OCR is a useful tool, but it is not perfect. Best OCR results occur with clearly readable text from high-quality scans. OCR quality degrades as copy quality degrades. OCR quality can also degrade or even fail with skewed or rotated pages, unusual fonts, dirty or specked backgrounds, low resolution scanning, etc.

Reasons for incomplete or failed OCR include the following: file corruption (pages may be corrupt and prevent OCR from running successfully), file print security (producers of PDFs often place print or content extraction restrictions on PDFs), and file open passwords prevent PDFs from OCRing.

OCR almost always produces some errors, and sometimes will produce many errors. OCR is best thought of as an adjunct to actual review of the file itself in native, TIFF or PDF format. A list of typical errors would include, but is not limited to: missing pages or files, skipping password protected files, skipping files with print, extract or other limitations on the file permissions, missing text in corrupted or unrecognized format, and failing to recognize rotated or skewed pages.

OCR of PDFs works on flattened image PDFs only and not on complex PDFs (i.e. PDFs with embedded attachments or Portfolio PDFs).

LEP attempts to generate searchable text from handwritten files. However OCR works best with printed (typed) fonts. Generally OCR of handwriting is not very accurate. See Working with Cursive Handwriting Files for more information.

Report abuse