KOI8-F

Last updated
KOI8 Unified
Alias(es)KOI8-F
Language(s) Belarusian, Ukrainian, Russian, Bulgarian, Serbian Cyrillic, Macedonian
Created byPeter Cassetta (Fingertip Software)
Classification8-bit KOI, extended ASCII
Extends KOI8-B
Based on KOI8-RU, KOI8-E
Other related encoding(s) KOI8-R, KOI8-U

KOI8-F or KOI8 Unified is an 8-bit character set. [1] It was designed by Peter Cassetta [2] of Fingertip Software (now defunct) as an attempt to support all the encoded letters from both KOI8-E (ISO-IR-111) and KOI8-RU (and hence also, KOI8-U and KOI8-R), along with some of the pseudographics from KOI8-R, [3] [2] with some additional punctuation in the remaining space, sourced partly from Windows-1251. [2] This encoding was only used in the software of that company.

Contents

Character set

The following table shows the KOI8-F encoding. Each character is shown with its equivalent Unicode code point. Differences from ISO-IR-111 are boxed; other relevant encodings which are matched, if any, are noted in footnotes.

KOI8-F [4]
0123456789ABCDEF
0x
1x
2x  SP   ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~
8x [lower-alpha 1]
2500
[lower-alpha 1]
2502
[lower-alpha 1]
250C
[lower-alpha 1]
2510
[lower-alpha 1]
2514
[lower-alpha 1]
2518
[lower-alpha 1]
251C
[lower-alpha 1]
2524
[lower-alpha 1]
252C
[lower-alpha 1]
2534
[lower-alpha 1]
253C
[lower-alpha 1]
2580
[lower-alpha 1]
2584
[lower-alpha 1]
2588
[lower-alpha 1]
258C
[lower-alpha 1]
2590
9x [lower-alpha 1]
2591
[lower-alpha 2]
2018
[lower-alpha 2]
2019
[lower-alpha 2]
201C
[lower-alpha 2]
201D
title="U+2219 BULLET OPERATOR orU+2022 BULLET" style="padding:1px;background:#FFD"|∙/• [lower-alpha 3] [lower-alpha 2]
2013
[lower-alpha 2]
2014
©
00A9
[lower-alpha 2]
2122
NBSP [lower-alpha 4] »
00BB
®
00AE
«
00AB
· [lower-alpha 1]
00B7
¤
00A4
Ax NBSP [lower-alpha 4] ђ
0452
ѓ
0453
ё
0451
є
0454
ѕ
0455
і
0456
ї
0457
ј
0458
љ
0459
њ
045A
ћ
045B
ќ
045C
ґ [lower-alpha 5]
0491
ў
045E
џ
045F
Bx
2116
Ђ
0402
Ѓ
0403
Ё
0401
Є
0404
Ѕ
0405
І
0406
Ї
0407
Ј
0408
Љ
0409
Њ
040A
Ћ
040B
Ќ
040C
Ґ [lower-alpha 5]
0490
Ў
040E
Џ
040F
Cx ю
044E
а
0430
б
0431
ц
0446
д
0434
е
0435
ф
0444
г
0433
х
0445
и
0438
й
0439
к
043A
л
043B
м
043C
н
043D
о
043E
Dx п
043F
я
044F
р
0440
с
0441
т
0442
у
0443
ж
0436
в
0432
ь
044C
ы
044B
з
0437
ш
0448
э
044D
щ
0449
ч
0447
ъ
044A
Ex Ю
042E
А
0410
Б
0411
Ц
0426
Д
0414
Е
0415
Ф
0424
Г
0413
Х
0425
И
0418
Й
0419
К
041A
Л
041B
М
041C
Н
041D
О
041E
Fx П
041F
Я
042F
Р
0420
С
0421
Т
0422
У
0423
Ж
0416
В
0412
Ь
042C
Ы
042B
З
0417
Ш
0428
Э
042D
Щ
0429
Ч
0427
Ъ
042A
  Differences from ISO-IR-111
  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Matching KOI8-R, KOI8-U, KOI8-RU.
  2. 1 2 3 4 5 6 7 Matching Windows-1251 and Windows-1252.
  3. May be U+2219, which matches RFC   1489 (KOI8-R), [4] or U+2022, which matches Windows-1251 and Windows-1252.
  4. 1 2 The non-breaking space is encoded twice: first at 0x9A matching KOI8-R, and then at 0xA0 matching KOI8-E (the latter of which also happens to be its location in Windows-1251 and Windows-1252).
  5. 1 2 Matching KOI8-U and KOI8-RU.

See also

Related Research Articles

Character encoding Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

ISO/IEC 8859-1 Character encoding for the Latin alphabets of Western European languages

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.

Mojibake Garbled text as a result of incorrect character encoding

Mojibake is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

ISO/IEC 646 is the name of a set of ISO/IEC standards, described as Information technology — ISO 7-bit coded character set for information interchange and developed in cooperation with ASCII at least since 1964. Since its first edition in 1967 it has specified a 7-bit character code from which several national standards are derived.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

ISO/IEC 8859-5:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 5: Latin/Cyrillic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin/Cyrillic. It was designed to cover languages using a Cyrillic alphabet such as Bulgarian, Belarusian, Russian, Serbian and Macedonian but was never widely used. It would also have been usable for Ukrainian in the Soviet Union from 1933 to 1990, but it is missing the Ukrainian letter ge, ґ, which is required in Ukrainian orthography before and since, and during that period outside Soviet Ukraine. As a result, IBM created Code page 1124.

KOI8-R is an 8-bit character encoding, derived from the KOI-8 encoding by the programmer Andrei Chernov in 1993 and designed to cover Russian, which uses a Cyrillic alphabet. KOI8-R was based on Russian Morse code, which was created from a phonetic version of Latin Morse code. As a result, Russian Cyrillic letters are in pseudo-Roman order rather than the normal Cyrillic alphabetical order. Although this may seem unnatural, if the 8th bit is stripped, the text is partially readable in ASCII and may convert to syntactically correct KOI7. For example, "Русский Текст" in KOI8-R becomes rUSSKIJ tEKST.

KOI8-U is an 8-bit character encoding, designed to cover Ukrainian, which uses a Cyrillic alphabet. It is based on KOI8-R, which covers Russian and Bulgarian, but replaces eight box drawing characters with four Ukrainian letters Ґ, Є, І, and Ї in both upper case and lower case.

Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages.

KOI (КОИ) is a family of several code pages for the Cyrillic script. The name stands for Kod obmena informatsiey which means "Code for Information Interchange".

KOI-8 (КОИ-8) is an 8-bit character set standardized in GOST 19768-74. It is an extension of KOI-7 which allows the use of the Latin alphabet along with the Russian alphabet, both the upper and lower case letters; however, the letter Ёё and the uppercase Ъ are missed, the latter to avoid conflicts with the delete character. The first 127 code points are identical to ASCII with the exception of the dollar sign $ replaced by the universal currency sign ¤. The rows x8_ and x9_ might be filled with the additional control characters from EBCDIC.

KOI-7 (КОИ-7) is a 7-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet.

The currency sign¤ is a character used to denote an unspecified currency. It can be described as a circle the size of a lowercase character with four short radiating arms at 45° (NE), 135° (SE), 225° (SW) and 315° (NW). It is raised slightly above the baseline. The character is sometimes called scarab.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

YUSCII is an informal name for several JUS standards for 7-bit character encoding. These include:

KOI8-RU is an 8-bit character encoding, designed to cover Russian, Ukrainian, and Belarusian which use a Cyrillic alphabet. It is closely related to KOI8-R, which covers Russian and Bulgarian, but replaces ten box drawing characters with five Ukrainian and Belarusian letters Ґ, Є, І, Ї, and Ў in both upper case and lower case. It is even more closely related to KOI8-U, which does not include Ў but otherwise makes the same replacements. The additional letter allocations are matched by KOI8-E, except for Ґ which is added to KOI8-F.

KOI8-T is an 8-bit single-byte extended ASCII character encoding adapting KOI8 to cover the Tajik Cyrillic alphabet. It was introduced by Michael Davis as an interim solution for representing Tajiki Cyrillic text in an interchangeable manner appropriate for use on the web, in an attempt to bridge the gap between existing non-interoperable font-specific encodings and the eventual wide adoption of Unicode. It is used by the GNU C Library as its default encoding for Tajik.

ISO-IR-153 is an 8-bit character set that covers the Russian and Bulgarian alphabets. Unlike the KOI encodings, this encoding lists the Cyrillic letters in their correct traditional order. This has become the basis for ISO/IEC 8859-5 and the Cyrillic Unicode block.

ISO-IR-111 or KOI8-E is an 8-bit character set. It is a multinational extension of KOI-8 for Belarusian, Macedonian, Serbian, and Ukrainian. The name "ISO-IR-111" refers to its registration number in the ISO-IR registry, and denotes it as a set usable with ISO/IEC 2022.

References

  1. Nechayev, Valentin (2013) [2001]. "Review of 8-bit Cyrillic encodings universe". Archived from the original on 2016-12-05. Retrieved 2016-12-05.
  2. 1 2 3 Czyborra, Roman (1998-11-30) [1998-05-25]. "The Cyrillic Charset Soup". Archived from the original on 2016-12-03. Retrieved 2016-12-03.
  3. "KOI8 Unified". Fingertip Software. Archived from the original on 1998-01-09. Retrieved 2020-02-11.
  4. 1 2 Leisher, Mark (2008) [1998-03-05]. "KOI8 Unified Cyrillic to Unicode 2.1 mapping table". Department of Mathematical Sciences, New Mexico State University. Retrieved 2020-05-02.