LEP includes what we believe is the most comprehensive search index in the eDiscovery industry. This Technical Note will describe how the Uber Index is built and what it is comprised of. The Uber Index joins (or concatenates) multiple versions and extracted data of the same document to increase searchability. The document versions joined are into a single index for comprehensive search results.
The Uber Index can be used in a variety of eDiscovery and Litigation Document Management processes:
>To improve search-based issue coding
>Review incoming productions to find hot or highly relevant documents
>To improve accuracy of privilege and work-product document review
>To improve the accuracy of predictive coding
>To enable cross-language search terms, including cross-language concept search
Below is a more detailed description of what is included in the Lexbe Uber Index, as well as how it works and it’s advantages over other less comprehensive search indexing approaches.
What is Included in the Lexbe Uber Index
The Lexbe Uber Index includes the following separate processed versions of a file:
The Uber Index is built on extracted text from the Native version of a file, when available. Our text extraction tools support most native files in use today in commercial litigation document collections. The Extracted text shows up in the ‘HTML’ tab in the Lexbe eDiscovery Platform Document Viewer, as well as being included in the Uber Index.
Lexbe’s Extracted Text indexing includes data that is often missed by eDiscovery and Search tools that rely on imaging and OCR alone. Missed documents from other approaches can include important real-world evidence, including hidden worksheets in Excel spreadsheets, revisions and comments in Word documents, and speakers notes in Powerpoint presentations. See below ‘Uber Index Advantages Over Other Search Tools-Search Tools with an Imaging/OCR-Only Approach’ for more examples.
Below illustrates a hidden excel sheet that is not included on converted PDF or OCR'd TEXT tab. It is, however, extracted from the native and contained on the HTML tab fully indexed for searching:
OCRed Text from Lexbe normalized-PDF
The Lexbe eDiscovery Platform creates a normalized PDF of all native documents uploaded to the system. This normalized PDF provides an easy to view paginated version of the document for our document viewer, and also is used for creating redacted and annotated versions of documents in a collection. As part of the PDF process, we OCR all image based files (such as JPG, PNG and TIFF) and also embedded images that may be included in Word and other Microsoft Office files. OCR is applied whenever possible to maximize the text searchability of all data in a file, no matter what the original format. The OCRed text approach we apply complements the native Extracted text above, and is available in the ‘Text’ tab in the Lexbe eDiscovery Platform Document viewer, and is joined for searchability in the Lexbe Uber Index. See Supported File Types for Automated PDF and TIFF Creation.
Lexbe’s OCRed Text indexing includes data that is missed by eDiscovery and Search tools that rely on extracted text from natives alone. Missed documents from these approaches can include anything that would be searchable only with the application of OCR. The following are examples where this can occur. Common real-world examples include scanned documents, email attachments that have not been OCRed, Office Files (Word, Excel or Powerpoint) that have images embedded in documents with text included, and image files with text. See below ‘Uber Index Advantages Over Other Search Tools-Search Tools with a Native-Extraction Only Approach’ for more examples.
Below illustrates a chart that is an embedded image on a PowerPoint slide. Note that the text cannot be extracted, but rather is OCR'd and appears on the TEXT tab fully indexed for searching:
The Lexbe Uber Index includes integrated translated versions of documents in a collection. These can be machine-translated (for speed and low cost) or human translation (for accuracy). The English translated version is joined into the Uber Index, along with the native extraction, and OCRed paginated versions, discussed above. This allows search (including concept search) across languages using a single search interface. See below ‘Uber Index Advantages Over Other Search Tools-Search Tools without Integrated Language Translation’ for more examples.
Document metadata can include important information for search retrieval, even though it may not show up in extracted or OCRed document text. The Lexbe Uber Index includes processed file metadata in the search index to improve searchability. The indexed metadata includes when available the document file name, the document author, email sender and recipients, and document and email dates.
Updating, Completeness and Speed of the Lexbe Uber Index
The Lexbe Uber Index updates in real-time as any new documents are added to the case. The Lexbe eDiscovery Platform is rare in this regard, as many common eDiscovery and Document Management tools require that an administrator manually update the search index. In the meantime many search indexes are unusable while re-indexing. The Uber-Index instead automatically supplements the index as new documents are added. If enough documents are added to cause the index to be incomplete a warning message will display on the search page while the index catches up. Lexbe has built the Uber Index to be very fast. Most search results return in seconds, even for very large collections.
Uber Index Advantages Over Other Search Tools
Other eDiscovery and general purpose search tools on the market today typically include some of what is in the Lexbe Uber Index, but not all.
Search Tools with an Imaging/OCR-Only Approach
These programs will typically process ESI by making a TIFF image of all pages in a document. Some more modern programs will instead create a PNG, JPG or image-PDF instead as these more modern formats can retain color, which the traditional TIFF used in document review programs does not. In creating paginated images of documents, these programs use a ‘virtual print-driver’ approach to create an image of a page that’s similar to what would be printed to a physical printer.
As such, the imaged page created includes only what would be printed physically. Important document information can be lost in a virtual printing process, including the following:
>Worksheets outside of the set print area of the worksheet
>Comments and Notes
>Longer Excels (Processing programs typically cut off printing after an arbitrary page limit such as 100 to avoid run-away TIFFing/printing).
>Saved changes and revisions
>Comments and Notes
>Text highlighting (can become unreadable after a TIFF conversion)
>Large page size settings (like blueprints) where text becomes too small when paginated to 8.5”x11”
>White or very small text in a document
>Documents with font sizing, type or style that makes impairs OCR
>Documents with text highlighting (may become unreadable and unsearchable after TIFfing).
Search Tools with a Native-Extraction Only Approach
These tools avoid the problems of the Imaging/OCR Only approach (above) by instead extracting text from the native file (e.g., Word, Excel, Powerpoint, Outlook) instead of relying on a virtual printing process. Whatever text can be extracted from a native file will usually be superior, but this approach is limited by the quality of the raw text. This approach can miss documents in a collection that would be searchable only with the application of OCR. The following are examples where this can occur.
Office Files (Word, Excel or Powerpoint)
>Images embedded in documents with text included
>Embedded files with un-OCRed text
>PDFs or other scanned documents that have not been OCRed
>Forwarded faxes (e.g., eFax) are usually un-OCRed TIFF images
Other File Types
>JPG, PNG and other image files with text in the image (must be OCRed)
>Files that can be virtually printed, but text extraction is not available.
Search Tools without Integrated Language Translation
Increasingly modern commercial cases are international in at least some aspects, and include multiple languages. Most attorneys and non-specialized reviewers in the United States speak only English. This means that important non-English documents can be missed. Increasingly email strings can include multiple languages as someone may reply in a different language than the original used by the sender. English-language search alone without application of automatic identification and translation of foreign documents may fail to return important responsive documents.
The following are examples of documents that may be unsearchable without integrated language translation:
>Email with multiple languages used in the email string.
>Foreign language documents attached to English-language emails.
>Foreign language documents mixed in with a mainly English-language collection.
Search Tools without Metadata Integration
File metadata is an important component of search to return documents from a search. File metadata is fielded data about a file and may never show up in the readable text of the document itself. Important file metadata that may benefit from integrated search includes the following:
>Document file name
>Document author/Email sender