Unicode and Foreign Language Support

Unicode support is important in litigation document management as document repositories may include documents rendered with non-English characters. These are typically produced in a version of Unicode, and a litigation database that does not support Unicode may not be able to display, index or search foreign words. This document briefly describes Unicode in a litigation support context and describes specifically foreign language support through Unicode in the Lexbe eDiscovery Platform (LEP) application.

Unicode

Unicode is a specification that allows text in any language to be encoded in a consistent way. Detailed information on the Unicode specification is maintained by the Unicode Consortium. Computers initially used an encoding scheme called ASCII to represent letters, but ASCII is English-centric and does not allow sufficient characters to represent many non-English characters. To remedy this deficiency, Unicode was developed. Unicode can represent tens of thousands of distinct characters. Multiple versions of Unicode are in use today, including UTF-8, UTF-16 and UTF-32. The UTF-8 and UTF-16 versions are the most widely used today.

Language Packs Languages based on Latin-based alphabets can be viewed and created in nearly any computer application without the need for any additional fonts. However, to type or display foreign languages that use non-Latin or certain extended Latin-based alphabets, the user may need to first download and install additional foreign language fonts on a local computer. You can tell if you need to download a font if characters appear as small rectangles. Microsoft Office contains a useful Arial Unicode MS font with coverage of nearly every character in every language included in the Unicode standard.

Unicode Support in LEP

LEP is primarily an English-language tool, but it does partially support Unicode as described below:

  • LEP supports the 8-bit (UTF-8) and 16-bit (UCS-16) encodings of Unicode, but not UCS-32.

  • LEP Unicode support means that it can index and search documents containing Unicode-encoded data. LEP can also display much Unicode in the LEP document browser, subject to client system font installation.

  • LEP can automatically recognize Unicode data in Microsoft Word, Excel, and PowerPoint files.

  • An HTML or XML file can include Unicode data if the HTML file uses the UTF-8 encoding. LEP can index and search Unicode data in UTF-8 encoded HTML files and can also recognize many other HTML encodings.

  • WordPerfect files use the WordPerfect Character Set to express non-English text. LEP converts WordPerfect Character Set data to Unicode for indexing, so non-English text in WordPerfect files is supported.

  • LEP can index and search Unicode characters in some, but not all, PDF files, depending on how the PDF file was created.

  • LEP's concept-search functionality is supported for the English language only.

  • Text in Chinese, Japanese, and Korean can be stored in, or converted to, Unicode, so LEP can search for words in these languages just as it can search for words in other languages. However, while LEP can search for literal word matches (or wildcard or fuzzy matches), there are some limitations on the support in LEP for Chinese, Japanese, and Korean text, described below.

Some documents store text in a way that does not separate the words with spaces. Instead, all of the text in a document is run together and a language-specific dictionary is needed to find word breaks. LEP does not have the ability to identify word breaks in these documents.

In some languages, such as Arabic, the surrounding context for a word (my, your, the, a, masculine/feminine, etc.) can be expressed as characters added in front of or behind the word. For example, "the apple" or "my apple" would not be two words but would be different prefixes or suffixes added to "apple." To search for text in these languages, adding a * in the front and back of the word will pick up most of the variants, like this: *apple*.

The above discussion provides examples only. There may be other limitations other than the ones described above.

If you have questions, please contact Professional Services.