Suggested Protocol on Scanning Documents for Loading
This technical note summarizes suggestions on scanning documents to PDF files for uploading to the Lexbe eDiscovery Platform (LEP).
Scanning Format. Scanning should generally be to multi page PDF (one file per document). We can also accept single-page TIFFs (Group IV), and JPEG and PNG images for color documents. Single-page TIFFs, JPG, and PNG images require a load file for document unitization.
OCR. Optical Character Recognition (OCR) of scanned documents is not needed. LEP does this automatically as part of processing.
Scanning Resolution. Generally, scan to 300 dpi to assure best accuracy for OCR and a balance of file size.
Color Documents and Photos. Color documents and photos (including highlighting) should be scanned to color or grayscale, depending on the specific needs of the case or project. Color and grayscale will increase file size.
Other Scan Settings. Generally, the settings auto-rotation and deskew should be chosen. Other scan settings will depend on the quality of originals.
Document Naming. Documents should be named with a unique, sequential file name. File names can use the scan date and time, if formatted to provide a sequential file name (e.g., YYYY-MM-DD hh-mm-ss) (use military). Documents are scanned in order to preserve associations between attachments, folder covers, etc.
Scan Quality. Generally, scan quality should be set high to ensure better quality documents.
Tabs, Folder Covers, etc. Tabs, folder covers, and similar items should be scanned.
Post-It Notes, Clipped Notes, Stapled Notes, etc. Pages with post-it notes, paper-clipped notes, stapled notes, etc., should be scanned twice, once with the attached item and once without.
Document Unitization. Documents should be scanned into separate files using Logical Document Determination (LDD) where document boundaries are determined by reviewing individual pages. An alternative is Physical Determination, where document boundaries are determined based on binding elements, clips or stapler pins, etc. Each PDF file should be one document. Multiple documents should not be joined and documents should not be scanned as single-page PDFs.
Attachments. If LDD is used, then attachments need to be linked to the parent document using a load file (below) or alternatively, scanned and named sequentially and included in a specific sub-folder.
Over-sized Documents. Blueprints and other over-sized documents often should be scanned at full size to preserve fidelity and detail. Down-sized 8”x 11” paginated versions can also be made to assist in review.
Custodian Identification. A specific document custodian (e.g., Smith, John) or other document source (e.g., Kalamazoo Warehouse, Archives Room) should be captured and included either in the folder structure or in a load file.
Load File or Folder Structure. If data can be exported to a load file, the format should be DAT/OPT, and relevant metadata should be included in the load file. Here is a link to load file specifications for single-paged TIFFS:
TIFF Image (DAT) Load File Specifications
and for Native Loads (including PDFs):
Nested Folder Structure. If unitization is to be handled with nested folders, the structure should clearly indicate Custodian and document groupings for covers, attachments, etc. For example:
2015-06-01 14-12-31 Smith, John.PDF
2015-06-01 14-13-03 Smith, John.PDF
2015-06-01 14-14-12 Smith, John (folder)
2015-06-01 14-13-46 Smith, John.PDF
2015-06-01 14-14-02 Smith, John.PDF
2015-06-01 14-15-10 Smith, John.PDF
2015-06-01 14-16-22 Smith, John.PDF