Deduplication
General
Extra copies of Outlook email (PST and MSG), Outlook Express email (EML) and various loose native files such as Excel spreadsheets, Word files, etc., are often found in duplicate form in a case. Deduplication in the Lexbe eDiscovery Platform (LEP) identifies duplicate files which can then be removed from a case. This optimizes the review process by reducing review inconsistencies, errors, and review costs.
Deduplication Within or Between Custodian Collections
Deduplication within each custodian is called vertical deduplication. Deduplication across custodians is called horizontal deduplication. Association of other custodians to the horizontally deduped files is maintained in a custom doc field section titled Additional Custodians.
When LEP looks for vertical duplicates, it runs through the custodian. For example, custodian-sender John sent an email with multiple attachments to several recipients. LEP will look vertically in John's documents for exact duplicates then will take both the email and attachment (the entire family), mark them together as a duplicate. If John's email is deleted, the attachments associated to the email are also deleted.
When LEP looks for horizontal duplicates, it runs through all custodians and unassigned documents in the case. In the above example, if any of the recipients of John's email are custodians in the case and the email is in their document collection, both email copies may be marked as duplicates.
How to Run Deduplication
1. Navigate to the Case tab and select Add Case Documents.
2. Click the white kebab (the three vertical dots) in the upper right corner of blue banner.
3. To perform horizontal deduplication (across all custodians), select DeDuplicate.
4. To perform vertical deduplication (within each custodian), select Deduplicate by Custodian.
5. Selecting either option will display the Run Task dialog box. Click OK to run deduplication.
NOTE: This feature will mark duplicates, but not delete them. Please note, the prime document is always the first uploaded to LEP. This ensures that any version added to LEP at a later time is marked as the duplicate and does not replace the prime version that may have since been coded.
How to Locate/View/Delete Duplicates
Accessing the Duplicates from the Browse Page
1. From the Browse page, expand the Filter section on the left and click SELECT FILTER.
2. Click in the Add Field text box and type "duplicates" and then select Duplicates from the dropdown.
3. The default "Document is a Duplicate" filter will be selected. Click OK to see the duplicates in the case.
4. To delete the duplicates, once they are displayed on the screen, click the "Select" dropdown and then Select All.
5. Once the duplicates have been selected, scroll to the bottom of the menu on the left and expand the Delete Docs section and select DELETED SELECTED DOCS.
NOTE: Only Account Administrators, Power Users and Standard Users Plus Add/Delete are able to delete documents (including duplicates) from a case.
Exporting Duplicate Document Report
Prior to deleting the duplicates, if you want to export a log outlining all duplicate documents, filter to Document is a Duplicate as discussed above. Display any corresponding fields to export on the screen. Click Select All documents and use the Export Log to Excel function from the left side menu bar. Save Excel file to local machine. Below is an example of a final report:
Why Duplicates Exist
Duplicates occur because individual custodians of data may have duplicate copies of files or documents. This often occurs with email. During a collection of an existing custodian, multiple email stores may be collected to ensure a thorough collection and multiple versions of the same email may be collected. An example would be collecting a PST file from a laptop, while also collecting a Google Gmail account from the cloud.
Another example would be duplicates between custodians. If John Smith sends an email to Bill Jones, the email will be duplicated in a collection from each custodian. This can lead to dozens or more of the same files. See Email Threading for more information.
A third example is attachments. One Word file might be added as an attachment to multiple emails, resulting in duplication within and between custodians.
A fourth example is somewhat similar emails that are part of email chains. While similar, the copies are distinct and the minor differences may be important (e.g., "Yes, I agree" added to a 1,000 word email).
A fifth example is the same paper file scanned at different times or with different file formats. While the underlying document is the same, each will be a different electronic file. If scanned with different software, the OCR text may be different or paper versions may be slightly different -- one has handwriting on it (e.g., "Yes, I agree"), while the other does not. It is important to use caution when removing duplicates.
Types Of Duplicates
Exact (Hash) duplicates. Loose native files (not email attachments) that are identified using an MD5 Hash Value of the entire electronic file would be exact duplicates. While some files may have the same content, that does not necessarily mean they are exact (hash) duplicates. For example, the same PDF file found twice in a case where the first file is an original scanned PDF without OCR, and the second is the same file with OCR applied (the addition of a text layer beneath the image), results in two files that are not exact electronic copies. The same is true for an identical document saved as a WORD file versus a PDF. The hash values would be different, therefore they would not be exact electronic copies.
Email Duplicates Based On Metadata. Emails are identified as duplicates using metadata from the email. LEP uses the following Outlook and Outlook Express email metadata fields for deduplication:
Sender
Recipients (including Cc and Bcc)
Email Subject Matter
Email Date & Time Sent
Corrupted emails sometimes have blank entries for one or more of the above categories. Therefore, all four fields must be present in order to deduplicate an email. Exact duplication ID will still retain documents that are from different custodians (if custodians are identified). This is for custodian tracking and identity. Also, separate versions of attachments to different email families are retained for email family integrity.
Near Duplicates. Near duplicates are files that are very similar but do not qualify as exact (hash) duplicates or email duplicates. These are the most difficult to identify and deal with in automated processes, as it is difficult or impossible to determine that the files are close enough to be exact duplicates. We recommend using near duplication and multi-doc editing to mass tag near duplication documents if desired. See Near Duplication for more information.
Still Finding Duplicates in the Case?
A file can be introduced into a document collection in multiple ways. Different custodians may attach the same document to an email and send it to different recipients, creating separate versions of the same file. During a document collection, those separate emails can be collected and introduced into a document collection and the same attachments are captured with each email. Those emails are kept in different families and deduplication will not eliminate copies in different email families.
Manually Identifying and Reviewing Duplicates
To expedite manual review and quality control, the user may group and review similar documents for batch coding, even though they do not qualify as exact duplicates or hash duplicates. To accomplish this, files can be displayed and sorted by title, extension, size or same number of words, looking for similarities that might indicate duplicates.
Here are the steps for this approach:
From the Browse or Search pages, expand the Layouts menu and select SELECT FIELDS.
Select the Original Title, Ext, Pages, Words, Size, and Attachment columns.
Sort on the various column headers and look for similarities that suggest duplicates. Do not delete a file unless you are sure it is a duplicate. Do not delete files as duplicates if they are an attachment to an email.
Duplicates in Loaded TIFF Productions
If an uploaded TIFF Production includes duplicates, these cannot be deduplicated. The TIFFs, separately generated, will hash differently and therefore will not be exact duplicates. Two other methods are possible to identify:
NearDup+ service. Documents grouped will not necessarily be exact duplicates, but will be grouped for bulk review and coding. This service is provided by Professional Services. Contact your sales consultant for a quote.
Incoming TIFF load files with MD-5 hash values. The user (or Professional Services) can group files in a custom field to allow for bulk review and coding.