Lotus Multi-Byte Character Set

Last updated

The Lotus Multi-Byte Character Set (LMBCS) is a proprietary multi-byte character encoding originally conceived in 1988 at Lotus Development Corporation with input from Bob Balaban and others. [1] Created around the same time and addressing some of the same problems, LMBCS could be viewed as parallel development and possible alternative to Unicode. [1] For maximum compatibility, later issues of LMBCS incorporate UTF-16 as a subset. [2] [3]

Contents

Commercially, LMBCS was first introduced as the default character set of Lotus 1-2-3 Release 3 for DOS in March 1989 [1] [4] and Lotus 1-2-3/G Release 1 for OS/2 [1] in 1990 replacing the 8-bit Lotus International Character Set (LICS) and ASCII used in earlier DOS-only versions of Lotus 1-2-3 and Symphony. [5] LMBCS is also used in IBM/Lotus SmartSuite, Notes and Domino, [1] as well as in a number of third-party products.

LMBCS encodes the characters required for languages using the Latin, [6] Arabic, Hebrew, Greek and Cyrillic [6] scripts, the Thai, Chinese, Japanese [6] and Korean writing systems, and technical symbols.

Encodings

Technically, LMBCS is a lead-byte encoding where code point 00hex as well as code points 20hex (32) to 7Fhex (127) are identical to ASCII [1] (as well as to LICS). [5]

Code point 00hex is always treated as NUL character to ensure maximum code compatibility with existing software libraries dealing with null-terminated strings [1] in many programming languages such as C. [lower-alpha 1] This applies even to the UTF-16be codes, where code words with the form xx00hex are mapped to private-use codes with the form F6xxhex during encoding in order to avoid the use of NUL bytes, [7] and to escaped control characters, where 20hex is added to the C0 (but not C1) control characters following the 0Fhex lead byte. [7]

Code points 01hex to 1Fhex, which serve as control codes in ASCII, are used as lead bytes to switch the definition of code points above 7Fhex between several code groups (similar to code pages) and at the same time determine either a single- or multi-byte nature for the corresponding code group. [1]

For example, code group 1 (with group byte 01hex) [1] is almost identical to the SBCS code page 850, whereas code group 16 (with group byte 10hex) [1] is similar to the Japanese MBCS code page 932. Multi-byte characters can thus occupy two or three bytes. [7] [6]

In canonical LMBCS, each character starts with its group byte. [1] To reduce the length, in optimized or compressed LMBCS a default code group or optimization group code can be defined on a per application or process basis (ideally chosen according to the highest likelihood of occurrence) [1] and must be communicated to the interpreting code in some way (f.e. by specifying the corresponding "LMBCS-n" name). [8] Thereby, the group byte can be omitted for these characters. [1] Lotus 1-2-3 retrieves the optimization group code from the file header of the corresponding source file, [7] whereas for Lotus Notes the optimization group code is fixed to be always 01hex. [2] [7]

DefaultGroupBytesDescription
N/A00hex1 [7] NUL
LMBCS-101hex2 [7] Code page 850 (DOS Latin-1) [2] [7]
LMBCS-202hex2 [7] Code page 851 (DOS Greek) [2] [7]
LMBCS-303hex2 [7] Code page 1255 (Windows Hebrew) [2] [7]
LMBCS-404hex2 [7] Code page 1256 (Windows Arabic) [2] [7]
LMBCS-505hex2 [7] Code page 1251 (Windows Cyrillic) [2] [7]
LMBCS-606hex2 [7] Code page 852 (DOS Latin-2) [2] [7]
N/A07hex1 [7] BEL [2]
LMBCS-808hex2 [7] Code page 1254 (Windows Turkish) [2] [9] [7]
N/A09hex1 [7] TAB [2] [9] [7]
N/A0Ahex1 [7] LF [2] [9] [7]
LMBCS-110Bhex2 [7] Code page 874 (Thai) [9] [7]
(LMBCS-12)0Chex2 [7] Reserved [2]
N/A0Dhex1 [7] CR [2] [9] [7]
(LMBCS-14)0Ehex2 [7] Reserved [2]
(LMBCS-15)0Fhex2 [7] Remapped C0/C1 control codes [7]
LMBCS-1610hex3 [7] Code page 932/ [2] 943 [7] (Japanese / Shift-JIS) [2] [9]
LMBCS-1711hex3 [7] Code page 949/ [2] 1261 [7] (Korean) [2] [9]
LMBCS-1812hex3 [7] Code page 950 [2] [7] (Traditional Chinese / Taiwan / Big5) [2] [9]
LMBCS-1913hex3 [7] Code page 936/ [2] 1386 [7] (Simplified Chinese) [2] [9]
(LMBCS-20)14hex3 [7] UTF-16 (Unicode) [2] [3] [7]
N/A15hex3Reserved [2]
N/A16hex3Reserved [2]
N/A17hex3Reserved [2]
N/A18hex3Reserved [2]
N/A19hex1 [7] Lotus 1-2-3 system range [9] [7]
N/A1Ahex3Reserved [2]
N/A1Bhex3Reserved [2]
N/A1Chex3Reserved [2]
N/A1Dhex3Reserved [2]
N/A1Ehex3Reserved [2]
N/A1Fhex3Reserved [2]

Character set

Without prefix byte the code points 32 (20hex) to 127 (7Fhex) are interpreted as follows (corresponding to LMBCS codes 32 to 127):

Single byte codes (ASCII/ISO-646-US [10] )
0123456789ABCDEF
2x  SP   ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ DEL/

Group 1

LMBCS group 1 code points 128 (80hex) to 255 (FFhex) are identical to the corresponding code points in code page 850 (DOS Latin-1), whereas code points 1 (01hex) to 127 (7Fhex) are defined according to the following exception list (corresponding to LMBCS codes 256 to 383):

LMBCS Group 1, lower half [11] [10]
0123456789ABCDEF
0x NUL
1x §
2x ¨ ~ ˚ ^ ` ´ ' - [lower-alpha 2] [lower-alpha 3] [lower-alpha 3]
3x ¨ [lower-alpha 4] ~ [lower-alpha 4] ˚ [lower-alpha 4] ^ [lower-alpha 4] ` [lower-alpha 4] ´ [lower-alpha 4] nbsp [lower-alpha 3] [lower-alpha 3]
4x Œ œ Ÿ ˙ [lower-alpha 3] ˚ [lower-alpha 3] [lower-alpha 4] [lower-alpha 3] [lower-alpha 3] [lower-alpha 3] [lower-alpha 3] [lower-alpha 3]
5x
6x ij IJ ʼn ŀ Ŀ ¯ [lower-alpha 3] ˘ [lower-alpha 3] ˝ [lower-alpha 3] ˛ [lower-alpha 3] ˇ [lower-alpha 3] ~ [lower-alpha 3] [lower-alpha 4] ^ [lower-alpha 3] [lower-alpha 4]
7x Ħ [lower-alpha 3] ħ [lower-alpha 3] Ŧ [lower-alpha 3] ŧ [lower-alpha 3] Ŋ [lower-alpha 3] ŋ [lower-alpha 3] ĸ [lower-alpha 3] Kr [lower-alpha 5]
  U+Mapped to a Unicode private use character

Group 2

LMBCS group 2 code points 128 (80hex) to 255 (FFhex) are identical to the corresponding code points in code page 851 (DOS Greek), whereas code points 1 (01hex) to 127 (7Fhex) are defined according to the following exception list: [lower-alpha 6]

LMBCS Group 2, lower half [11]
0123456789ABCDEF
0x NUL ͺ ΅ Ϊ Ϋ ΄ ʼ ʽ
1x
2x
3x
4x
5x
6x φ
7x
  Mapped to a Unicode private use character

Group 6

LMBCS group 6 code points 128 (80hex) to 255 (FFhex) are identical to the corresponding code points in code page 852 (DOS Latin-2), whereas code points 1 (01hex) to 127 (7Fhex) are defined according to the following exception list: [lower-alpha 6]

LMBCS Group 6, lower half [11]
0123456789ABCDEF
0x NUL ā Ĉ ĉ Ċ ċ Ē ē Ė ė Ĝ ĝ Ġ ġ Ģ ģ
1x Ĥ ĥ Ĩ ĩ Ī ī Į į Ĵ ĵ Ķ ķ Ļ ļ Ņ ņ
2x Ō ō Ŗ ŗ Ŝ ŝ Ũ ũ Ū ū Ŭ ŭ Ų ų Ā
3x
4x
5x
6x
7x

See also

Notes

  1. Lotus 1-2-3 Release 3.0 for DOS and newer versions are written in C.
  2. (U+2010), (U+2011), (U+2012), (U+2013)
  3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 According to the documentation this code point is not supported by Lotus 1-2-3 Release 3.1+ for DOS and OS/2 and earlier.
  4. 1 2 3 4 5 6 7 8 9 For compatibility with Lotus 1-2-3 Release 5.0.
  5. Unicode does not define a glyph for the crown currency symbol (Krone aka "Kr"), therefore this points to F8FBhex in the Unicode Private Use Area (PUA).
  6. 1 2 According to the documentation code points 1 to 127 in this group are not supported by Lotus 1-2-3 Release 3.1+ for DOS and OS/2 and earlier. These versions only supported LMBCS code points 0 to 511, covering group 0 and 1 only.

Related Research Articles

<span class="mw-page-title-main">ASCII</span> American character encoding standard

ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope. Many computer systems instead use Unicode, which has millions of code points, but the first 128 of these are the same as the ASCII set.

<span class="mw-page-title-main">ISO/IEC 8859-1</span> Character encoding

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed.

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

<span class="mw-page-title-main">Windows-1252</span> Character encoding

Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet, that was used by default in e.g. Microsoft Windows for English and many (European) languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. Windows now uses Unicode character sets by default.

<span class="mw-page-title-main">Code page 437</span> Character set of the original IBM PC

Code page 437 is the character set of the original IBM PC. It is also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (diacritics), Greek letters, icons, and line-drawing symbols. It is sometimes referred to as the "OEM font" or "high ASCII", or as "extended ASCII".

UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to five one-byte (8-bit) code units. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

<span class="mw-page-title-main">Code page 866</span> Code page

Code page 866 is a code page used under DOS and OS/2 in Russia to write Cyrillic script. It is based on the "alternative code page" developed in 1984 in IHNA AS USSR and published in 1986 by a research group at the Academy of Science of the USSR. The code page was widely used during the DOS era because it preserves all of the pseudographic symbols of code page 437 and maintains alphabetic order of Cyrillic letters. Initially, this encoding was only available in the Russian version of MS-DOS 4.01 (1990) and since MS-DOS 6.22 in any language version.

Code page 852 is a code page used under DOS to write Central European languages that use Latin script.

In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols, control characters, or formatting. The set of all possible code points within a given encoding/character set make up that encoding's codespace.

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16. The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

IBM code page 932 is one of IBM's extensions of Shift JIS. The coded character sets are JIS X 0201:1976, JIS X 0208:1983, IBM extensions and IBM extensions for IBM 1880 UDC. It is the combination of the single-byte Code page 897 and the double-byte Code page 301. Code page 301 is designed to encode the same repertoire as IBM Japanese DBCS-Host.

Symbol is one of the four standard fonts available on all PostScript-based printers, starting with Apple's original LaserWriter (1985). It contains a complete unaccented Greek alphabet and a selection of commonly used mathematical symbols. Insofar as it fits into any standard classification, it is a serif font designed in the style of Times New Roman.

Code page 856, is a code page used under DOS for Hebrew in Israel.

<span class="mw-page-title-main">Atari ST character set</span> Character set of the Atari ST personal computer family

The Atari ST character set is the character set of the Atari ST personal computer family including the Atari STE, TT and Falcon. It is based on code page 437, the original character set of the IBM PC, and like that set includes ASCII codes 32–126, extended codes for accented letters (diacritics), and other symbols. It differs from code page 437 in using other dingbats at code points 0–31, in exchanging the box-drawing characters 176–223 for the Hebrew alphabet and other symbols, and exchanging code points 158, 236 and 254–255 with the symbols for sharp S, line integral, cubed and macron.

Microsoft Windows code page 932, also called Windows-31J amongst other names, is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

The Lotus International Character Set (LICS) is a proprietary single-byte character encoding introduced in 1985 by Lotus Development Corporation. It is based on the 1983 DEC Multinational Character Set (MCS) for VT220 terminals. As such, LICS is also similar to two other descendants of MCS, the ECMA-94 character set of 1985 and the ISO 8859-1 (Latin-1) character set of 1987.

The PostScript Standard Encoding is one of the character sets used by Adobe Systems' PostScript (PS) since 1984 (1982). In 1995, IBM assigned code page 1276 to this character set. NeXT based the character set for its NeXTSTEP and OPENSTEP operating systems on this one.

Code page 37, known as "USA/Canada - CECP", is an EBCDIC code page used on IBM mainframes. It encodes the ISO/IEC 8859-1 repertoire of graphic characters.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 Balaban, Bob (2001). "Multi-Language Character Sets – What They Are, How To Use Them" (PDF). Looseleaf Software, Inc. Archived (PDF) from the original on 2016-11-25. Retrieved 2016-11-25.
  2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 "Appendix A. Encoding Schemes". IBM Character Data Representation Architecture . IBM (CDRA). Lotus Multi-byte Character Set (LMBCS). Archived from the original on 2016-11-26. Retrieved 2016-11-26. For optimization purposes, the group byte is omitted in Notes for single-byte values between X'20' and X'FF'. For example, LMBCS is always optimized to group 0x01, which means that any character where the first byte is greater than 0x1F, has an implicit group byte of 0x01.
  3. 1 2 Scherer, Markus; Murray, Brendan (2000-06-02). "Re: MS Excel, Lotus 123 & Unicode". Archived from the original on 2016-12-06. Retrieved 2016-12-06.
  4. "Kapitel 4. Kompatibilität mit anderen 1-2-3 Versionen – Zeichensätze" [Chapter 4. Compatibility with other 1-2-3 Versions – Character Sets]. Lotus 1-2-3 Version 3.1 Upgrader's Handbuch[Upgrader's handbook] (in German) (1 ed.). Cambridge, MA, USA: Lotus Development Corporation. 1989. pp. 4-10–4-11. 302173.
  5. 1 2 Kamenz, Alfred; Vonhoegen, Helmut (1992). Das große Buch zu Lotus 1-2-3 für DOS (in German) (1 ed.). Data Becker. pp. 131–132, 357–358. ISBN   3-89011-375-3.
  6. 1 2 3 4 Lotus – Inside Notes – The Architecture of Notes and the Domino Server (PDF). Lotus Development Corporation. 2000. Archived (PDF) from the original on 2016-12-12. Retrieved 2016-12-12. […] Notes uses a single character set, the Lotus Multibyte Character Set (LMBCS), to encode all text data used internally by its programs. Whenever Notes first inputs text encoded in a character set other than LMBCS, it translates the text into a LMBCS string, and whenever it must output text in a character set other than LMBCS, it translates the internal LMBCS string into the appropriate character set. Because all text is internally formatted by LMBCS, all text-processing operations […] are done in only one way. LMBCS uses up to three bytes in memory to represent a single text character […]
  7. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Murray, Brendan; Snyder-Grant, Jim, eds. (2016) [2000-02-09]. "ucnv_lmb.c". International Components for Unicode . International Business Machines (IBM).
  8. Batutis, Edward J. (2001-11-03). "Re: converter types". International Components for Unicode (ICU). Archived from the original on 2016-12-06. Retrieved 2016-12-06.
  9. 1 2 3 4 5 6 7 8 9 10 "LMBCS" (in Japanese). 2009-02-03. Archived from the original on 2016-11-26. Retrieved 2016-11-26.
  10. 1 2 "Anhang 2. Der Lotus Multibyte Zeichensatz (LMBCS)" [Appendix 2. The Lotus Multibyte Character Set (LMBCS)]. Lotus 1-2-3 Version 3.1 Referenzhandbuch[Lotus 1-2-3 Version 3.1 Reference Manual] (in German) (1 ed.). Cambridge, MA, USA: Lotus Development Corporation. 1989. pp. A2-1–A2-13. 302168.
  11. 1 2 3 "lmb-excp.ucm". GitHub . 2000-02-10.

Further reading