Near Duplication

This technical note will discuss the process and associations of Near Duplication in the Lexbe eDiscovery Platform (LEP). 

Near Duplication identifies files and emails that are nearly identical (50% similar in text content). This would include, for example, multiple versions of a Microsoft Word document with slight modifications over time or a repetitive chain of email threads. Examples of documents that appear to be exact duplicates, but are different based on computer analysis are as follows:

Documents scanned at different times.

Email threads.  See Email Threading for more information.

Documents saved to PDF at different times.

Documents with small unobservable editing changes.

Near Duplication is independent of Deduplication.  See Deduplication for more information.

Contact Professional Services

Near Duplication is a service provided by Professional Services.  Contact your Sales Representative for pricing information.  

Types of Files

Identical.  Two or more documents with the exact same extension, subject, number of words, and content.

Exact Duplicates.  Two or more documents are exact duplicates when the textual content of the two documents is the same. A Microsoft Word file and the PDF version of that file are duplicates. 

Near Duplicates.  Significantly similar versions of documents that differ by, for example, a few sentences, words or paragraphs.

Benefits Of Near Duplication

Find similar documents once a hot document has been identified.

Reduce the chance of inadvertent privilege document release by identifying documents that are similar to ones marked privilege or work product.

Reduce the number of documents hosted in a review environment.

Reduce review time by allowing large batches of similar documents to be reviewed and coded at one time.

Increase consistency of review by allowing similar documents to be coded the same.

Include near duplicate documents when creating review sets. 

Increase quality control of outgoing productions.

Near Duplication and Emails

The Lexbe eDiscovery Platform (LEP) looks vertically through a custodian's emails for near duplicates and will take both the email and any attachments (the entire family) and mark as a near duplicate.  If a custodian's email is deleted the attachment will also be deleted.   

Separate versions of the same document may be attached to different email custodians.  These are not marked as exact duplicates, but retained for email family integrity. Therefore, the same document may appear in different custodian's documents.

The user may identify related email families by content in order to identify all emails in a group, detect missing emails, and then keep only the relevant email messages for review.  

How it Works

After Professional Services performs NearDup, the database will be updated with near-duplicate groupings of documents.  The user will also receive an Excel spreadsheet report entitled NearDup Grouping Report.

This feature only marks near-duplicates.  It does not delete them (no batch title selection necessary). All existing files with similar contents will be grouped across the entire case (e.g. Group 1, Group 2, etc). 

How to Identify Near Duplicates

LEP will categorize the near duplicate files under the same group with identical numerical values  (e.g., Group 1) that can be accessed from Browse and Search pages (Sort, Show Fields or Select Filters).

The user may also access near duplicate files from the the Document Viewer (Near Duplicates section).  

Filtering Near Duplicate Groups

Use filters to narrow a search for a specific set of documents or select to view only a particular Near Dupe Group ID by indicating that specified ID in the textbox.

Show Fields for All Near Duplicate Group. From the Search or Browse pages, click on Fields->Select Fields->Near Dup Groups. This will display all the near duplicates classified in the groups.

Filter and Display One Group. To narrow the results and show only one specific Near Dup Group, filter by Near Dup Group No., for example 3472.

Sort on Specific Groups. Sort on the groups by clicking on the field title Near Dup Group.

Export Log to Excel. Export the near duplicate log to an Excel spreadsheet to keep track of documents produced where privileged information might not have been removed. An Excel log allows the user to filter, sort, and see if there are near duplicates inconsistently coded (e.g., confidentiality, email threading, attachment, etc.).

Manually Identify and Review Near Duplicates

From the Browse or Search pages, show only Near Dup Group.  Go to Field->Show Field Section->Built-in Doc Fields.

Select the Original Title, Ext, Pages, Words, Size, and IsEmailAttachment columns. 

Performing Linear Comparison of Near Duplicates

Lexbe has a built in comparison tool located in the Doc Viewer.  For information on how to use this feature, see Document Comparison Tool 

Risks Of Deleting Email Attachments

As a best practice, do not delete near duplicates marked as IsEmailAttachment so that users reviewing an email collection will be able to establish how the content was distributed and who may have been sharing information. Often, email (MSGs) created by different custodians will contain the same attachments.

An email attachment and a loose file that are near duplicates will not be in the same group. Generally, email attachments should not be deleted as near duplicates, unless the entire email body is deleted also. LEP will not near dedup email attachments from different email families.

See Mass Tagging Near Duplicates