Supported File Types for Extracted Native Text

This Technical Note describes supported file types used and supported in the Lexbe eDiscovery Platform (LEP) for Extracted Native Text.

General

The Lexbe eDiscovery Platform (LEP) supports a number of file formats for native review. For files that include text in the file, this text is indexed into the application search engine for full-featured search and retrieval.

Files that do not convert as part of the automated processing services are marked with a placeholder file, either in PDF or TIFF, depending on the service ordered.

Prior to converting files, LEP applies container file expansion, DeNIST, and extension repair procedures.  See Automated ESI Processing for more information.  

LEP uses built-in document filters for extracted-text document parsing.  LEP text extraction generally detects file formats by examining the actual file contents, not the extension or reported MIME type, to avoid being affected by misleading filenames.  For example, a Word document named "sample.exe" would still be identified as a Word document.  In some ambiguous cases, such as distinguishing XML and HTML files, the extension is used as a clue as part of the parsing algorithm.  

Embedded objects and attachments are indexed as part of the document that contains them.  For example, a spreadsheet object embedded in a PowerPoint presentation would be treated as part of the PowerPoint presentation. 

See Uber Index [link]

Supported File Types for Native Text Extraction 

LEP identifies and attempts to convert, the following file types for native extraction:

 Ext  Application/Description
 Type
  bmp    Image BMP    Image
  class   Java programming file  Text
  config    Application configuration File  Text
  css    Cascading style sheet, web page support  Text
  csv   Comma-separated values  Text
  doc    Microsoft Word   Text
  docx    Microsoft Word   Text
  eml
  Microsoft Outlook Express email
 Email
  gif   Image GIF  Image
  htm   HTML web page  Web Page
  html    HTML web page   Web Page
  ics    iCalendar file   Text
  inf    Setup Information File  Text
  ini    Text configuration file  Text
  jpeg   JPEG   Image
  jpg    JPG  Image
  js   Javascript programming file  Text
  json   Javascript object notation file  Text
  lnk    Windows File Shortcut  Text
  log   Appplication log file  Text
  manifest    Java programming file  Text
  mht    HTML web page  Web Page
  mht   MHT archives saved by Internet Explorer  Web page
  msg
  Micorsoft Outlook email
 Email
  pdf   Adobe Acrobat, converted from text  Text
  pdf    Adobe Acrobat, image only  Image
  pdf   Adobe Acrobat, text under image  Image
  php    PHP programming file  Text
  png    PNG image   Image
  pps   Microsoft PowerPoint    Presentation
  ppsx    Microsoft PowerPoint   Presentation
  ppt    Microsoft PowerPoint   Presentation
  pptx   Microsoft PowerPoint   Presentation
  pst    Microsoft Outlook data files
 Container
  py    Python Programming Script    Text
  rar    RAR  Container
  rtf    Microsoft Rich Text Format   Text
  tif    TIF  Image
  tiff    TIFF   Image
  txt    ASCII   Text
  url   Unliform Resource Locator file  Text
  vcf    Vcard contact information file   Text
  xls    Microsoft Excel   Spreadsheet
  xlsm   Microsoft Excel   Spreadsheet
  xlsx   Microsoft Excel   Spreadsheet
  xml     XML text  Text
  zip   Archive   Container                
   
Autocad Supported File Types for Extraction

 Ext  Application/Description
 Type
 dwg  Autocad Native Format  Design data
 dxf   Autocad Drawing Exchange Format  Design data

Mac Supported File Types for Extraction

 Ext  Application/Description
 Type
 .pages  iWork Pages for the Mac   Text
 .numbers  iWork Numbers for the Mac   Spreadsheet
 .key  iWork Keynotes for the Mac   Presentation

**MBX email files do not auto convert see below**

Failure to Convert Standard File Types

If a standard file type fails to convert, a placeholder file is created and it is noted in the database record that the file Failed to Convert. Standard file types may fail to convert for a variety of reasons, including: file corruption, file type mis-identification, print or data extraction issues, and password protection.  Inherently, password protected files are not searchable (even with dual index) and require extra due diligence.  

Some non-extracted standard file types can be converted manually as a professional service (billable hourly or per GB, depending on file type and issues involved).

Other Files Not Extracted

LEP does not auto-extract native text from files other than the standard file types listed above.  Instead, a placeholder file is created and it is noted in the database record that the file was Not Converted (i.e., is not supported). 

For some non-converted, non-standard file types, text can be extracted manually as a professional service (
billable hourly or per GB, depending on file type and issues involved).

Files that do not extract native text would include: media files, some container files, some email files, database files, and others, described in more detail below.  

Failure to extract native text from a file does not mean it does not contain probative evidence, only that it did not extract text with automated procedures.  These files should be reviewed and further steps taken to extract the native text, when appropriate.

Media Files

Native text cannot be extracted from 
Media files (video and audio).  They can be uploaded to LEP and coded and they can sometimes be viewed or played depending on file type, connection speed, local browser, computer settings, and installed applications.  The following is a list of common media file types:

 Ext  Application/Description  Type
 avi
 Windows video
 Video
 asf   ASF  Video
 m4a  QuickTime  Video
 m4p  Apple  Video
 m4v    QuickTime  Video
 mov   QuickTime  Audio
 mp3  MP3   Audio
 swf   Flash
 Video
 wav   Wav file  Audio
 wma   WMA  Audio or Video
 wmf     Windows  Metafile Format 
 wmv  Windows Video
 Video

Unusual Container Files

As part of automated processing, LEP extracts ZIP and RAR files.  LEP does not automatically extract unusual container files. Examples, would include:  7z, G7, Iza, Jar, Sit, JP2.  Many container files can be extracted manually as a professional service (billed hourly). 

Email Files

The automated conversion process automatically extracts native text from Outlook, PST and MSG files.  
The automated process does extract productivity files used for the Mac (e.g., Microsoft Office for the Mac, Apple Numbers, etc.).  and these files occasionally convert, depending on version and other factors.  However, more often, will not convert, mis-convert or generate internal Mac resource fork files.  Other email files or stores that  native text can be manually extracted from as a professional service prior to automated processing are listed below:

 Ext  Application/Description
 dbx    Microsoft Outlook Express 5 and 6 
 mbs  Opera Email for Windows
 mbx   Eudora message files
 mbx   MBOX archives, including Google Mail, Apple Mail, and  Thunderbird
 nsf  Lotus Notes

Database Files

The automated extraction process does not extract text from database files.  Database types (depending on type and version) that can be processed manually as a professional service are listed below: 

 Ext  Application/Description
 dbf
 Oracle or other database                                                                  
 frm  MySQL
 myd  MySQL
 myi  MySQL
 mdb   Microsoft Access Database 
 mdbx   Microsoft Access Database 
 iif  Intuit interchange file, Quickbooks
 ldf  SQL Server
 qba  Quickbooks
 qbb  Quickbooks
 qbm  Quickbooks
 qbw  Quickbooks
 qbx  Quickbooks
 qby  Quickbooks

Mac Files

Mac email is not supported (Mac native or Outlook for the Mac) but these files can have native text extracted from them by Professional Services.  Best practice is for Mac productivity files to be converted to MS Office for Windows version 2007 or 2010 prior to upload. See also Pre-Processing MBox Files (Gmail and Apple Mail).

Other Files That Native Text is Not Automatically Extracted From

 Ext  Application/Description  Type
 123  Lotus 1-2-3 (*.123, *.wk?)  Spreadsheet
 art Bitmap image file compressed by AOL  Image
 cpp  C++ programming      Source code
 doc  Microsoft Word for the Mac   Text
 docs  Microsoft Word for the Mac   Text
 epsf  EPSF    Image
 hjt   Treepad HJT files  Other                                                          
 hpp                                 C++ Programming file      Header File
 mpp  Microsoft Project   Other
 mppx   Microsoft Project   Other
 obd  Office binder document  Container
 qpw   Quattro Pro    Spreadsheet
 sam   Ami Pro   Text                           
 tmp  Application Temporary File  Other
 vdx   Visio XML files   Image
 vcf  MS Outlook, contact info  Text
 xlk  Backup file created by MS Excel  Other
 wb1    Quattro Pro   Spreadsheet
 wb2  Quattro Pro   Spreadsheet
 wb3   Quattro Pro   Spreadsheet
 wks   Microsoft Works
 Text
 wpg    WPG version 1.0 only   Image