Lotus Multi-Byte Character Set

Last updated

The Lotus Multi-Byte Character Set (LMBCS) is a proprietary multi-byte character encoding originally conceived in 1988 at Lotus Development Corporation with input from Bob Balaban and others. [1] Created around the same time and addressing some of the same problems, LMBCS could be viewed as parallel development and possible alternative to Unicode. [1] For maximum compatibility, later issues of LMBCS incorporate UTF-16 as a subset. [2] [3]

Contents

Commercially, LMBCS was first introduced as the default character set of Lotus 1-2-3 Release 3 for DOS in March 1989 [1] [4] and Lotus 1-2-3/G Release 1 for OS/2 [1] in 1990 replacing the 8-bit Lotus International Character Set (LICS) and ASCII used in earlier DOS-only versions of Lotus 1-2-3 and Symphony. [5] LMBCS is also used in IBM/Lotus SmartSuite, Notes and Domino, [1] as well as in a number of third-party products.

LMBCS encodes the characters required for languages using the Latin, [6] Arabic, Hebrew, Greek and Cyrillic [6] scripts, the Thai, Chinese, Japanese [6] and Korean writing systems, and technical symbols.

Encodings

Technically, LMBCS is a lead-byte encoding where code point 00hex as well as code points 20hex (32) to 7Fhex (127) are identical to ASCII [1] (as well as to LICS). [5]

Code point 00hex is always treated as NUL character to ensure maximum code compatibility with existing software libraries dealing with null-terminated strings [1] in many programming languages such as C. [a] This applies even to the UTF-16be codes, where code words with the form xx00hex are mapped to private-use codes with the form F6xxhex during encoding in order to avoid the use of NUL bytes, [7] and to escaped control characters, where 20hex is added to the C0 (but not C1) control characters following the 0Fhex lead byte. [7]

Code points 01hex to 1Fhex, which serve as control codes in ASCII, are used as lead bytes to switch the definition of code points above 7Fhex between several code groups (similar to code pages) and at the same time determine either a single- or multi-byte nature for the corresponding code group. [1]

For example, code group 1 (with group byte 01hex) [1] is almost identical to the SBCS code page 850, whereas code group 16 (with group byte 10hex) [1] is similar to the Japanese MBCS code page 932. Multi-byte characters can thus occupy two or three bytes. [7] [6]

In canonical LMBCS, each character starts with its group byte. [1] To reduce the length, in optimized or compressed LMBCS a default code group or optimization group code can be defined on a per application or process basis (ideally chosen according to the highest likelihood of occurrence) [1] and must be communicated to the interpreting code in some way (f.e. by specifying the corresponding "LMBCS-n" name). [8] Thereby, the group byte can be omitted for these characters. [1] Lotus 1-2-3 retrieves the optimization group code from the file header of the corresponding source file, [7] whereas for Lotus Notes the optimization group code is fixed to be always 01hex. [2] [7]

DefaultGroupBytesDescription
N/A00hex1 [7] NUL
LMBCS-101hex2 [7] Code page 850 (DOS Latin-1) [2] [7]
LMBCS-202hex2 [7] Code page 851 (DOS Greek) [2] [7]
LMBCS-303hex2 [7] Code page 1255 (Windows Hebrew) [2] [7]
LMBCS-404hex2 [7] Code page 1256 (Windows Arabic) [2] [7]
LMBCS-505hex2 [7] Code page 1251 (Windows Cyrillic) [2] [7]
LMBCS-606hex2 [7] Code page 852 (DOS Latin-2) [2] [7]
N/A07hex1 [7] BEL [2]
LMBCS-808hex2 [7] Code page 1254 (Windows Turkish) [2] [9] [7]
N/A09hex1 [7] TAB [2] [9] [7]
N/A0Ahex1 [7] LF [2] [9] [7]
LMBCS-110Bhex2 [7] Code page 874 (Thai) [9] [7]
(LMBCS-12)0Chex2 [7] Reserved [2]
N/A0Dhex1 [7] CR [2] [9] [7]
(LMBCS-14)0Ehex2 [7] Reserved [2]
(LMBCS-15)0Fhex2 [7] Remapped C0/C1 control codes [7]
LMBCS-1610hex3 [7] Code page 932/ [2] 943 [7] (Japanese / Shift-JIS) [2] [9]
LMBCS-1711hex3 [7] Code page 949/ [2] 1261 [7] (Korean) [2] [9]
LMBCS-1812hex3 [7] Code page 950 [2] [7] (Traditional Chinese / Taiwan / Big5) [2] [9]
LMBCS-1913hex3 [7] Code page 936/ [2] 1386 [7] (Simplified Chinese) [2] [9]
(LMBCS-20)14hex3 [7] UTF-16 (Unicode) [2] [3] [7]
N/A15hex3Reserved [2]
N/A16hex3Reserved [2]
N/A17hex3Reserved [2]
N/A18hex3Reserved [2]
N/A19hex1 [7] Lotus 1-2-3 system range [9] [7]
N/A1Ahex3Reserved [2]
N/A1Bhex3Reserved [2]
N/A1Chex3Reserved [2]
N/A1Dhex3Reserved [2]
N/A1Ehex3Reserved [2]
N/A1Fhex3Reserved [2]

Character set

Without prefix byte the code points 32 (20hex) to 127 (7Fhex) are interpreted as follows (corresponding to LMBCS codes 32 to 127):

Single byte codes (ASCII/ISO-646-US [10] )
0123456789ABCDEF
2x  SP   ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ DEL/

Group 1

LMBCS group 1 code points 128 (80hex) to 255 (FFhex) are identical to the corresponding code points in code page 850 (DOS Latin-1), whereas code points 1 (01hex) to 127 (7Fhex) are defined according to the following exception list (corresponding to LMBCS codes 256 to 383):

LMBCS Group 1, lower half [11] [10]
0123456789ABCDEF
0x NUL
1x §
2x ¨ ~ ˚ ^ ` ´ ' - [b] [c] [c]
3x ¨ [d] ~ [d] ˚ [d] ^ [d] ` [d] ´ [d] nbsp [c] [c]
4x Œ œ Ÿ ˙ [c] ˚ [c] [d] [c] [c] [c] [c] [c]
5x
6x ij IJ ʼn ŀ Ŀ ¯ [c] ˘ [c] ˝ [c] ˛ [c] ˇ [c] ~ [c] [d] ^ [c] [d]
7x Ħ [c] ħ [c] Ŧ [c] ŧ [c] Ŋ [c] ŋ [c] ĸ [c] Kr [e]
  U+Mapped to a Unicode private use character

Group 2

LMBCS group 2 code points 128 (80hex) to 255 (FFhex) are identical to the corresponding code points in code page 851 (DOS Greek), whereas code points 1 (01hex) to 127 (7Fhex) are defined according to the following exception list: [f]

LMBCS Group 2, lower half [11]
0123456789ABCDEF
0x NUL ͺ ΅ Ϊ Ϋ ΄ ʼ ʽ
1x
2x
3x
4x
5x
6x φ
7x
  Mapped to a Unicode private use character

Group 6

LMBCS group 6 code points 128 (80hex) to 255 (FFhex) are identical to the corresponding code points in code page 852 (DOS Latin-2), whereas code points 1 (01hex) to 127 (7Fhex) are defined according to the following exception list: [f]

LMBCS Group 6, lower half [11]
0123456789ABCDEF
0x NUL ā Ĉ ĉ Ċ ċ Ē ē Ė ė Ĝ ĝ Ġ ġ Ģ ģ
1x Ĥ ĥ Ĩ ĩ Ī ī Į į Ĵ ĵ Ķ ķ Ļ ļ Ņ ņ
2x Ō ō Ŗ ŗ Ŝ ŝ Ũ ũ Ū ū Ŭ ŭ Ų ų Ā
3x
4x
5x
6x
7x

See also

Notes

  1. Lotus 1-2-3 Release 3.0 for DOS and newer versions are written in C.
  2. (U+2010), (U+2011), (U+2012), (U+2013)
  3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 According to the documentation this code point is not supported by Lotus 1-2-3 Release 3.1+ for DOS and OS/2 and earlier.
  4. 1 2 3 4 5 6 7 8 9 For compatibility with Lotus 1-2-3 Release 5.0.
  5. Unicode does not define a glyph for the crown currency symbol (Krone aka "Kr"), therefore this points to F8FBhex in the Unicode Private Use Area (PUA).
  6. 1 2 According to the documentation code points 1 to 127 in this group are not supported by Lotus 1-2-3 Release 3.1+ for DOS and OS/2 and earlier. These versions only supported LMBCS code points 0 to 511, covering group 0 and 1 only.

Related Research Articles

<span class="mw-page-title-main">ASCII</span> American character encoding standard

ASCII, an acronym for American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. ASCII has just 128 code points, of which only 95 are printable characters, which severely limit its scope. The set of available punctuation had significant impact on the syntax of computer languages and text markup. ASCII hugely influenced the design of character sets used by modern computers, including Unicode which has over a million code points, but the first 128 of these are the same as ASCII.

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page, or character map.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding method capable of encoding all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two 16-bitcode units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a legacy single-byte character encoding that is used by default in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa.

ISO/IEC 8859-6:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 6: Latin/Arabic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Arabic. It was designed to cover Arabic. Only nominal letters are encoded, no preshaped forms of the letters, so shaping processing is required for display. It does not include the extra letters needed to write most Arabic-script languages other than Arabic itself.

The Hong Kong Supplementary Character Set is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong.

<span class="mw-page-title-main">Code page 437</span> Character set of the original IBM PC

Code page 437 is the character set of the original IBM PC. It is also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (diacritics), Greek letters, icons, and line-drawing symbols. It is sometimes referred to as the "OEM font" or "high ASCII", or as "extended ASCII".

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters.

<span class="mw-page-title-main">Code page 866</span> Computer character set for Russian

Code page 866 is a code page used under DOS and OS/2 in Russia to write Cyrillic script. It is based on the "alternative code page" developed in 1984 in IHNA AS USSR and published in 1986 by a research group at the Academy of Science of the USSR. The code page was widely used during the DOS era because it preserves all of the pseudographic symbols of code page 437 and maintains alphabetic order of Cyrillic letters. Initially this encoding was only available in the Russian version of MS-DOS 4.01 (1990), but with MS-DOS 6.22 it became available in any language version.

A code point, codepoint or code position is a particular position in a table, where the position has been assigned a meaning. The table may be one dimensional, two dimensional, three dimensional, etc... in any number of dimensions.

Several 8-bit character sets (encodings) were designed for binary representation of common Western European languages, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols. These character sets also happen to support many other languages such as Malay, Swahili, and Classical Latin.

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16. They are intentionally left undefined so that third parties may assign their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

IBM code page 932 is one of IBM's extensions of Shift JIS. The coded character sets are JIS X 0201:1976, JIS X 0208:1983, IBM extensions and IBM extensions for IBM 1880 UDC. It is the combination of the single-byte Code page 897 and the double-byte Code page 301. Code page 301 is designed to encode the same repertoire as IBM Japanese DBCS-Host.

In computing HP Roman is a family of character sets consisting of HP Roman Extension, HP Roman-8, HP Roman-9 and several variants. Originally introduced by Hewlett-Packard around 1978, revisions and adaptations were published several times up to 1999. The 1985 revisions were later standardized as IBM codepages 1050 and 1051. Supporting many European languages, the character sets were used by various HP workstations, terminals, calculators as well as many printers, also from third-parties.

Microsoft Windows code page 932, also called Windows-31J amongst other names, is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

<span class="mw-page-title-main">Code page 949 (IBM)</span>

IBM code page 949 (IBM-949) is a character encoding which has been used by IBM to represent Korean language text on computers. It is a variable-width encoding which represents the characters from the Wansung code defined by the South Korean standard KS X 1001 in a format compatible with EUC-KR, but adds IBM extensions for additional hanja, additional precomposed Hangul syllables, and user-defined characters.

The Lotus International Character Set (LICS) is a proprietary single-byte character encoding introduced in 1985 by Lotus Development Corporation. It is based on the 1983 DEC Multinational Character Set (MCS) for VT220 terminals. As such, LICS is also similar to two other descendants of MCS, the ECMA-94 character set of 1985 and the ISO 8859-1 (Latin-1) character set of 1987.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 Balaban, Bob (2001). "Multi-Language Character Sets – What They Are, How To Use Them" (PDF). Looseleaf Software, Inc. Archived (PDF) from the original on 2016-11-25. Retrieved 2016-11-25.
  2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 "Appendix A. Encoding Schemes". IBM Character Data Representation Architecture . IBM (CDRA). Lotus Multi-byte Character Set (LMBCS). Archived from the original on 2016-11-26. Retrieved 2016-11-26. For optimization purposes, the group byte is omitted in Notes for single-byte values between X'20' and X'FF'. For example, LMBCS is always optimized to group 0x01, which means that any character where the first byte is greater than 0x1F, has an implicit group byte of 0x01.
  3. 1 2 Scherer, Markus; Murray, Brendan (2000-06-02). "Re: MS Excel, Lotus 123 & Unicode". Archived from the original on 2016-12-06. Retrieved 2016-12-06.
  4. "Kapitel 4. Kompatibilität mit anderen 1-2-3 Versionen – Zeichensätze" [Chapter 4. Compatibility with other 1-2-3 Versions – Character Sets]. Lotus 1-2-3 Version 3.1 Upgrader's Handbuch[Upgrader's handbook] (in German) (1 ed.). Cambridge, MA, USA: Lotus Development Corporation. 1989. pp. 4-10–4-11. 302173.
  5. 1 2 Kamenz, Alfred; Vonhoegen, Helmut (1992). Das große Buch zu Lotus 1-2-3 für DOS (in German) (1 ed.). Data Becker. pp. 131–132, 357–358. ISBN   3-89011-375-3.
  6. 1 2 3 4 Lotus – Inside Notes – The Architecture of Notes and the Domino Server (PDF). Lotus Development Corporation. 2000. Archived (PDF) from the original on 2016-12-12. Retrieved 2016-12-12. […] Notes uses a single character set, the Lotus Multibyte Character Set (LMBCS), to encode all text data used internally by its programs. Whenever Notes first inputs text encoded in a character set other than LMBCS, it translates the text into a LMBCS string, and whenever it must output text in a character set other than LMBCS, it translates the internal LMBCS string into the appropriate character set. Because all text is internally formatted by LMBCS, all text-processing operations […] are done in only one way. LMBCS uses up to three bytes in memory to represent a single text character […]
  7. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Murray, Brendan; Snyder-Grant, Jim, eds. (2016) [2000-02-09]. "ucnv_lmb.c". International Components for Unicode . International Business Machines (IBM).
  8. Batutis, Edward J. (2001-11-03). "Re: converter types". International Components for Unicode (ICU). Archived from the original on 2016-12-06. Retrieved 2016-12-06.
  9. 1 2 3 4 5 6 7 8 9 10 "LMBCS" (in Japanese). 2009-02-03. Archived from the original on 2016-11-26. Retrieved 2016-11-26.
  10. 1 2 "Anhang 2. Der Lotus Multibyte Zeichensatz (LMBCS)" [Appendix 2. The Lotus Multibyte Character Set (LMBCS)]. Lotus 1-2-3 Version 3.1 Referenzhandbuch[Lotus 1-2-3 Version 3.1 Reference Manual] (in German) (1 ed.). Cambridge, MA, USA: Lotus Development Corporation. 1989. pp. A2-1–A2-13. 302168.
  11. 1 2 3 "lmb-excp.ucm". GitHub . 2000-02-10.

Further reading