Unicode and Foreign Language Support

Unicode support is important in litigation document management as document repositories may include documents rendered with non-English characters. These are typically produced in a version of Unicode, and a litigation database that does not support Unicode may not be able to display, index or search foreign words. This document briefly describes Unicode in a litigation support context and describes specifically foreign language support through Unicode in the Lexbe eDiscovery Platform (LEP) application.

Unicode

Unicode is a specification that allows text in any language to be encoded in a consistent way. Detailed information on the Unicode specification is maintained by the Unicode Consortium. Computers initially used an encoding scheme called ASCII to represent letters, but ASCII is English-centric and does not allow sufficient characters to represent many non-English characters. To remedy this deficiency, Unicode was developed. Unicode can represent tens of thousands of distinct characters. Multiple versions of Unicode are in use today, including UTF-8, UTF-16 and UTF-32. The UTF-8 and UTF-16 versions are the most widely used today.

Language Packs Languages based on Latin-based alphabets can be viewed and created in nearly any computer application without the need for any additional fonts. However, to type or display foreign languages that use non-Latin or certain extended Latin-based alphabets, the user may need to first download and install additional foreign language fonts on a local computer. You can tell if you need to download a font if characters appear as small rectangles. Microsoft Office contains a useful Arial Unicode MS font with coverage of nearly every character in every language included in the Unicode standard.

Unicode Support in LEP

LEP is primarily an English-language tool, but it does partially support Unicode as described below: 

Some documents store text in a way that does not separate the words with spaces. Instead, all of the text in a document is run together and a language-specific dictionary is needed to find word breaks. LEP does not have the ability to identify word breaks in these documents. 

In some languages, such as Arabic, the surrounding context for a word (my, your, the, a, masculine/feminine, etc.) can be expressed as characters added in front of or behind the word. For example, "the apple" or "my apple" would not be two words but would be different prefixes or suffixes added to "apple." To search for text in these languages, adding a * in the front and back of the word will pick up most of the variants, like this: *apple*. 

The above discussion provides examples only.  There may be other limitations other than the ones described above.

If you have questions, please contact Professional Services.