QC'ing Files with Little or No Text
General Description
The Lexbe eDiscovery Platform (LEP) supports TIFF and PDF file formats including the full-text indexed version. However, the user may come across some files that might have little or even no text indexed into the application search engine when the text version is opened.
Reasons Files Are Not Indexed By SearchEngine
Some files may not be searchable for different reasons, as follows:
Corrupted (damage, data corruption, or virus)
Password-protected
Restricted for viewing
Embedded files or attachments
Separate TIFF/text file pairs (single-page or multi-page)
Hidden or encrypted words
How To Identify Files With Little Or No Text Index
From the Browse, Search, and Add Case Documents pages, the user can select under the section Fields>Show Fields specific columns to be displayed in the main table. We recommend the following:
Title
Rank
Ext
Pages
Words This column reflects the total number of words captured during the search index, and it should reflect the total number of words that are in the original document plus the number of words from the Normalized PDF with our Multi- Index tool. Our search returns results from both text extracted from native files and text from a PDF-paginated version we create, including OCRed images.
Size
Doc Type
Export the current log to an Excel spreadsheet, following the steps below:
From the current page in LEP, select all documents (click on the link Select all XX documents in list).
Click Export>Export Log to Excel
Open (or save and open) the Excel spreadsheet
Format the spreadsheet creating an extra column called Words per Page to calculate the average number of words per page.
Exporting To Excel
Based on the words per page, calculate the average number and detect which files have either little or no text indexed.
Column G of the spreadsheet above is showing a file with '0' words per page and others with low averages of words per page. In cases like this, open the files in order to detect the reason the average numbers are not corresponding to the Words/Pages of the document.