Deduplication

General

Extra copies of Outlook email (PST and MSG), Outlook Express email (EML) and various loose native files such as, Excel spreadsheets, Word files, etc. are often found in duplicate form in a case. The Lexbe eDiscovery Platform (LEP) identifies duplicate files which can then be removed from a case.  This optimizes the review process by reducing review inconsistencies, errors, and review costs. 

Deduplication Within or Between Custodian Collections

Deduplication within each custodian is called vertical deduplication.  Deduplication between custodians is called horizontal deduplication.  Generally, horizontal deduplication is not considered a best practice as it loses the association of other custodians to the deduped files.  If needed, however, it can be done in LEP
 if custodian assignments have not been made.

When LEP looks for duplicates, it runs vertically through the custodian. For example, custodian-sender John sent an email with multiple attachments to several recipients.  LEP will look vertically in John's documents for exact duplicates then will take both the email and attachment (the entire family), mark them together as a duplicate.  If John's email is deleted, the attachments associated to the email are also deleted. 

Deduplication Process in LEP

Deduplication is done within identified Custodians only (using the Custodian field in LEP) and not between Custodians. If no Custodians are identified in LEP for a case (no Custodians assigned), then duplication is effected across all documents.

Deduplication is accessed from Case->Add Case Documents->Deduplication by clicking the Dedup Case Docs button.  This feature will mark duplicates, but not delete them. When started, all existing duplicate identifications are run across the entire case.  Only Account Admin Users can apply deduplication.  All users can view and use.  Please note, the prime document is always the first uploaded to LEP.  This ensures that any version added to LEP at a later time is marked as the duplicate and does not replace the prime version that may have since been coded
.

How to Identify Duplicates

LEP will categorize duplicate files under the column IsDuplicate that can be accessed from the Browse and Search pages (Sort, Show Columns or Select Filters).


The user can also access duplicate files from the Document Viewer, identified under the Exact Duplicate section. From this page the user is able to find similar key documents, accelerate review by mass tagging similar documents, and check consistency on responsiveness and privilege groupings between similar documents, email threading, etc.


Filter For Duplicates 

Once deduplication is completed, go to the Browse or Search pages and apply one of the following filters:
Duplicate->Show Duplicates Only. Will only display the group of files marked as duplicates. 
Duplicates->Exclude Duplicates. Will hide all files marked as IsDuplicate in the case. Filters will not change the file count within a case (no automatic deletion).  For purposes of LEP Storage Calculation delete any unneeded files from the case.

When the filter is applied, LEP
 automatically saves the filter under the Filter Quick Links section and creates filter hyperlinks to open a specific set of documents. Rename the filter by clicking the Edit hyperlink. 




Deleting Duplicates

To remove duplicate files, apply filter on Show Duplicates Only = checked, select All XX Documents in Case.  Deleted duplicates cannot be recovered.  An alternative is to filter for view (Select Filter->Duplicates->Exclude Duplicates) or download before deleting, so recovery is possible. Best practice is to create a download briefcase as a backup and save to desktop.  There is not a 100% guaranteed way to detect duplicates.


Risks Of Deleting Email Attachments

Best practice is to not delete duplicates marked as IsEmailAttachment in LEP
.  Users reviewing an email collection need to be able to establish how the content was distributed and who may have been sharing information. Often, email (MSG) created by different custodians contains the same attachments. 

An email attachment and a loose file will not be marked as IsDuplicate in LEP
. An attachment generally should not be deleted as a duplicate unless the entire email body is as well. Otherwise, the email family will be broken and the Document Viewer will not associate them. LEP will not deduplicate email attachments from different email families.



Why Duplicates Exist

Duplicates occur because individual custodians of data may have duplicate copies of files or documents.  This often occurs with email.  During a collection of an existing custodian, multiple email stores may be collected to ensure a thorough collection and multiple versions of the same email may be collected.  An example would be collecting a PST file from a laptop, while also collecting a Google Gmail account from the cloud. 

Another example would be duplicates between custodians.  If John Smith sends an email to Bill Jones, the email will be duplicated in a collection  from each custodian.  This can lead to dozens or more of the same files. See Email Threading for more information.

A third example is attachments.  One Word file might be added as an attachment to multiple emails, resulting in duplication within and between custodians.



A fourth example is somewhat similar emails that are part of email chains.  While similar, the copies are distinct and the minor differences may be important (e.g., "Yes, I agree" added to a 1,000 word email).

A fifth example is the same paper file scanned at different times or with different file formats.  While the underlying document is the same, each will be a different electronic file.  If scanned with different software, the OCR text may be different or paper versions may be slightly different -- one has handwriting on it (e.g., "Yes, I agree"), while the other does not.  It is important to use caution when removing duplicates.



Types Of Duplicates

Exact (Hash) duplicates. Loose native files (not email attachments) that are identified using an MD5 Hash Value of the entire electronic file would be exact duplicates. While some files may have the same content, that does not necessarily mean they are exact (hash) duplicates.  For example, the same PDF file found twice in a case where the first file is an original scanned PDF without OCR, and the second is the same file with OCR applied (the addition of a text layer beneath the image), results in two files that are not exact electronic copies.  The same is true for an identical document saved as a WORD file versus a PDF.  The hash values would be different, therefore they would not be exact electronic copies.

Email Duplicates Based On Metadata. Emails are identified as duplicates using metadata from the email.  LEP uses the following Outlook and Outlook Express email metadata fields for deduplication:

Sender
Recipients (including Cc and Bcc)
Email Subject Matter
Email Date & Time Sent

Corrupted emails sometimes have blank entries for one or more of the above categories.  Therefore, all four fields must be present in order to deduplicate an email.  Exact duplication ID will still retain documents that are from different custodians (if custodians are identified).  This is for custodian tracking and identity.  Also, separate versions of attachments to different email families are retained for email family integrity.    

Near Duplicates. Near duplicates are files that are very similar but do not qualify as exact (hash) duplicates or email duplicates.  These are the most difficult to identify and deal with in automated processes, as it is difficult or impossible to determine that the files are close enough to be exact duplicates. 
We recommend using near duplication and multi-doc editing to mass tag near duplication documents if desired. See Near Duplication for more information.

Still Finding Duplicates in the Case?

A file can be introduced into a document collection in multiple ways. Different custodians may attach the same document to an email and send it to different recipients, creating separate versions of the same file. During a document collection, those separate emails can be collected and introduced into a document collection and the same attachments are captured with each email. Those emails are kept in different families and deduplication will not eliminate copies in different email families.

Manually Identifying and Reviewing Duplicates

To expedite manual review and quality control, the user may group and review similar
documents for batch coding, even though they do not qualify as exact duplicates or hash duplicates.  To accomplish this, files can be displayed and sorted by title, extension, size or same number of words, looking for similarities that might indicate duplicates. 

Here are the steps for this approach:

From the Browse or Search pages, show Column Section->Built-in Doc Fields
Select the Original Title, Ext, Pages, Words, Size, and IsEmailAttachment columns. 

Sort on the various column heads and look for similarities that suggest duplicates.  Do not delete a file unless you are sure it is a duplicate.  Do not delete files as duplicates if they are an attachment to an email.



Duplicates in Loaded TIFF Productions

If an uploaded TIFF Production includes duplicates, these cannot be deduplicated.  The TIFFs, separately generated, will hash differently and therefore will not be exact duplicates.  Two other methods are possible to identify:

NearDup+ service.  Documents grouped will not necessarily be exact duplicates, but will be grouped for bulk review and coding. This service is provided by Technical Services.  Contact your sales consultant for a quote.  

Incoming TIFF load files with MD-5 hash values. The user (or Technical Services) can group files in a custom field to allow for bulk review and coding.