Automated ESI Processing

This technical note describes the automated file conversion and processing that occurs when electronically stored information (ESI) is uploaded to a case in the Lexbe eDiscovery Platform (LEP).

Automated File Processing

Automated file processing refers to changing the state of ESI without direct human error-checking or proofing. Automated file processing keeps the cost of LEP services reasonable. Many eDiscovery processes involve hundreds of thousands of files, millions of page equivalents, and billions of individual processes. If human manual processing were utilized for even a small portion of the process, it would be cost-prohibitive.

Will every native file convert during the automated process?

Not every file will convert with automated processing. In large jobs, conversion issues are not only possible but expected.

LEP identifies and attempts to convert (to TIFF or PDF), a wide variety of file formats such as doc, eml, emlx, html, ics, ppt, rar, xlsx. The failure to convert a file does not mean it does not contain probative evidence, only that it did not convert with the automated process. All files that fail to convert should receive further review. See Supported for Processing File Types for more information.

What happens when a file type does not convert properly?

If LEP attempts to convert a standard file type and cannot, a placeholder file is generated to indicate the file Failed to Convert.

Files Failed to Convert: Standard file types may fail to convert for a variety of reasons, including file corruption, file type mis-identification, print or data extraction issues, and password protection.

Files Not Converted: Non-standard files that LEP does not support, which could include, media files, some container files, some email files, database files, and other file types.

Some standard file types that fail to convert can be converted manually by Professional Services (hourly or per GB charges depending on file type and issues involved).

Automated ESI Processing during Document Upload Includes:

Archive/Container Decompression. During upload and processing LEP expands container files. Container files can be defined as certain compound file types that LEP can separate into parts or reassemble without losing data as part of the conversion process.

LEP expands the following container files automatically as part of selected automated conversion services.

PST
RAR
ZIP
7z
MBOX (Exported from gmail, not iCloud)
OST

Metadata extraction. Extraction of email body and attachments from Outlook MSG and associated with an MSG container for load file preparation. Supported Outlook MSG and Outlook Express EML files are processed to recursively extract attachments. See Master Date and Metadata Fields for more information.

MD-5 hash code generation. The MD-5 Hash is a commonly used cryptographic function that produces a 128-bit (16-byte) hash value, expressed as a hexadecimal number, 32 digits long. MD-5 is used to check data integrity. The value will change if there is any change in the contents of a file. The hash value is independent of the file name, as it is stored on the operating system, not in the file. LEP deduplicates loose native files using the MD-5 hash of each file.

File extension repair and encoding. LEP parses files and attempts to automatically identify file types for proper recognition and conversion services. LEP does not rely on extensions alone to detect file formats, as extensions can be changed (intentionally or unintentionally) and otherwise become corrupt in encoding, unencoding and transfer.

The Lexbe file repair process identifies file types from their binary signatures. It uses a database of definitions which describe recurring patterns for supported file types (over 8,000). Our programming logic is extensible and can be trained to recognize new formats in a fast and automatic way.

Our file repair process would, for example, take as input a Word document named "sample.mp3", identify it as a Word document from its binary signature, and then repair the extension to the correct sample.doc. See Repairing PSTs for more information.

DeNIST. See DeNIST Files for more information.

Email attachment extraction and parent email association. Emails and their attachments are in order in LEP. The first document is the email body, while subsequent documents will display the attachments. See Email Attachments for more information.

Native text extraction. LEP's native extraction process works by inputting raw native file versions and flattening any attached files from container files, like an MSG and extracting file metadata. In the case of Outlook OST and PST files, that means taking each MSG file and opening and extracting data from any attachments.

Optical character recognition (OCR) of image files. Optical character recognition (OCR) is the process of taking image files (e.g., scanned from documents) and electronically converting them into searchable text. See OCR for more information.

Full-text indexing. LEP indexes both extracted text in native files and the OCRed text from paginated versions of the same files, in a comprehensive, combined index for fast and easy searching.

PDF creation. All documents uploaded into LEP automatically have OCR applied and a PDF version created and associated with the original document.

Custodian/Case Participants Assignment. Custodians/Case Participants can be assigned to documents as part of the upload process or after upload using Multi-Doc Edit. They can also be assigned in the Document Viewer after upload. See Custodians for more information.

Manually Resizing Large Images or PDFs for Best Bates Display

Before uploading PDFs or native files that contain large images, re-size the images to 8.5" x 11.00" to ensure that any Bates stamps added to the lower right hand corner as part of a production, will be clearly visible.

Use a recent version of Acrobat Pro or another PDF Bates-stamping utility with granular page-offset options, to re-size the images and re-upload to LEP in order to make the Bates stamps clearly visible at a page-level. Alternatively, this is a service provided by Professional Services (billed at an hourly rate). As the images show the Bates number in the filename, many find manual processing to up-size an unnecessary expense.

Bates Applied to Logos and Other Small Images

LEP's automated file processing will take small images and apply them to blank PDFs (8.5" x 11" standard letter size paper) without the font running off the page, and the Bates number will be also applied to the Bates name. See Logos and Other Small Images for more information.

Additional Features

Deduplication

Near Duplication

Email Threading

Report abuse