KOI8-F

Last updated
KOI8 Unified
Alias(es)KOI8-F
Language(s) Belarusian, Ukrainian, Russian, Bulgarian, Serbian Cyrillic, Macedonian
Created byPeter Cassetta (Fingertip Software)
Classification8-bit KOI, extended ASCII
Extends KOI8-B
Based on KOI8-RU, KOI8-E
Other related encoding(s) KOI8-R, KOI8-U

KOI8-F or KOI8 Unified is an 8-bit character set. [1] It was designed by Peter Cassetta [2] of Fingertip Software (now defunct) as an attempt to support all the encoded letters from both KOI8-E (ISO-IR-111) and KOI8-RU (and hence also, KOI8-U and KOI8-R), along with some of the pseudographics from KOI8-R, [3] [2] with some additional punctuation in the remaining space, sourced partly from Windows-1251. [2] This encoding was only used in the software of that company. FreeDOS calls it code page 60270.

Contents

Character set

The following table shows the KOI8-F encoding. Each character is shown with its equivalent Unicode code point. Differences from ISO-IR-111 are boxed; other relevant encodings which are matched, if any, are noted in footnotes.

KOI8-F [4]
0123456789ABCDEF
0x
1x
2x  SP   ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~
8x [lower-alpha 1]
2500
[lower-alpha 1]
2502
[lower-alpha 1]
250C
[lower-alpha 1]
2510
[lower-alpha 1]
2514
[lower-alpha 1]
2518
[lower-alpha 1]
251C
[lower-alpha 1]
2524
[lower-alpha 1]
252C
[lower-alpha 1]
2534
[lower-alpha 1]
253C
[lower-alpha 1]
2580
[lower-alpha 1]
2584
[lower-alpha 1]
2588
[lower-alpha 1]
258C
[lower-alpha 1]
2590
9x [lower-alpha 1]
2591
[lower-alpha 2]
2018
[lower-alpha 2]
2019
[lower-alpha 2]
201C
[lower-alpha 2]
201D
∙/• [lower-alpha 3] [lower-alpha 2]
2013
[lower-alpha 2]
2014
©
00A9
[lower-alpha 2]
2122
NBSP [lower-alpha 4] »
00BB
®
00AE
«
00AB
· [lower-alpha 1]
00B7
¤
00A4
Ax NBSP [lower-alpha 4] ђ
0452
ѓ
0453
ё
0451
є
0454
ѕ
0455
і
0456
ї
0457
ј
0458
љ
0459
њ
045A
ћ
045B
ќ
045C
ґ [lower-alpha 5]
0491
ў
045E
џ
045F
Bx
2116
Ђ
0402
Ѓ
0403
Ё
0401
Є
0404
Ѕ
0405
І
0406
Ї
0407
Ј
0408
Љ
0409
Њ
040A
Ћ
040B
Ќ
040C
Ґ [lower-alpha 5]
0490
Ў
040E
Џ
040F
Cx ю
044E
а
0430
б
0431
ц
0446
д
0434
е
0435
ф
0444
г
0433
х
0445
и
0438
й
0439
к
043A
л
043B
м
043C
н
043D
о
043E
Dx п
043F
я
044F
р
0440
с
0441
т
0442
у
0443
ж
0436
в
0432
ь
044C
ы
044B
з
0437
ш
0448
э
044D
щ
0449
ч
0447
ъ
044A
Ex Ю
042E
А
0410
Б
0411
Ц
0426
Д
0414
Е
0415
Ф
0424
Г
0413
Х
0425
И
0418
Й
0419
К
041A
Л
041B
М
041C
Н
041D
О
041E
Fx П
041F
Я
042F
Р
0420
С
0421
Т
0422
У
0423
Ж
0416
В
0412
Ь
042C
Ы
042B
З
0417
Ш
0428
Э
042D
Щ
0429
Ч
0427
Ъ
042A
  Differences from ISO-IR-111
  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Matching KOI8-R, KOI8-U, KOI8-RU.
  2. 1 2 3 4 5 6 7 Matching Windows-1251 and Windows-1252.
  3. May be U+2219, which matches RFC   1489 (KOI8-R), [4] or U+2022, which matches Windows-1251 and Windows-1252.
  4. 1 2 The non-breaking space is encoded twice: first at 0x9A matching KOI8-R, and then at 0xA0 matching KOI8-E (the latter of which also happens to be its location in Windows-1251 and Windows-1252).
  5. 1 2 Matching KOI8-U and KOI8-RU.

KOI8-C/KOI8-CA

A variant is KOI8-C, also known as KOI8-CA, is an 8-bit character set. It is a modification of KOI8-F to support Caucasian languages while retaining support in the same languages as KOI8-F. FreeDOS calls it code page 61294. It has hardly ever been used. KOI8-C once referred to what is now known as KOI8-O.

KOI8-C/KOI8-CA (differing rows only) [5]
0123456789ABCDEF
8x ғ
0493
җ
0497
қ
049B
ҝ
049D
ң
04A3
ү
04AF
ұ
04B1
ҳ
04B3
ҷ
04B7
ҹ
04B9
һ
04BB

2580
ә
04D9
ӣ
04E3
ө
04E9
ӯ
04EF
9x Ғ
0492
Җ
0496
Қ
049A
Ҝ
049C
Ң
04A2
Ү
04AE
Ұ
04B0
Ҳ
04B2
Ҷ
04B6
Ҹ
04B8
Һ
04BA

2321
Ә
04D8
Ӣ
04E2
Ө
04E8
Ӯ
04EE
  Differences from KOI8-F

See also

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

<span class="mw-page-title-main">ISO/IEC 8859-1</span> Character encoding

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

ISO/IEC 646 is a set of ISO/IEC standards, described as Information technology — ISO 7-bit coded character set for information interchange and developed in cooperation with ASCII at least since 1964. Since its first edition in 1967 it has specified a 7-bit character code from which several national standards are derived.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. Initially the same as ISO 8859-1, it began to diverge starting in Windows 2.0.

ISO/IEC 8859-5:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 5: Latin/Cyrillic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin/Cyrillic.

ISO/IEC 8859-14:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 14: Latin alphabet No. 8 (Celtic), is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1998. It is informally referred to as Latin-8 or Celtic. It was designed to cover the Celtic languages, such as Irish, Manx, Scottish Gaelic, Welsh, Cornish, and Breton.

KOI8-R is an 8-bit character encoding, derived from the KOI-8 encoding by the programmer Andrei Chernov in 1993 and designed to cover Russian, which uses a Cyrillic alphabet. KOI8-R was based on Russian Morse code, which was created from a phonetic version of Latin Morse code. As a result, Russian Cyrillic letters are in pseudo-Roman order rather than the normal Cyrillic alphabetical order. Although this may seem unnatural, if the 8th bit is stripped, the text is partially readable in ASCII and may convert to syntactically correct KOI-7. For example, "Русский Текст" in KOI8-R becomes rUSSKIJ tEKST.

KOI8-U is an 8-bit character encoding, designed to cover Ukrainian, which uses a Cyrillic alphabet. It is based on KOI8-R, which covers Russian and Bulgarian, but replaces eight box drawing characters with four Ukrainian letters Ґ, Є, І, and Ї in both upper case and lower case.

Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages.

KOI (КОИ) is a family of several code pages for the Cyrillic script. The name stands for Kod obmena informatsiey which means "Code for Information Interchange".

KOI-7 (КОИ-7) is a 7-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet.

The currency sign¤ is a character used to denote an unspecified currency. It can be described as a circle the size of a lowercase character with four short radiating arms at 45° (NE), 135° (SE), 225° (SW) and 315° (NW). It is raised slightly above the baseline. The character is sometimes called scarab.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

KOI8-RU is an 8-bit character encoding, designed to cover Russian, Ukrainian, and Belarusian which use a Cyrillic alphabet. It is closely related to KOI8-R, which covers Russian and Bulgarian, but replaces ten box drawing characters with five Ukrainian and Belarusian letters Ґ, Є, І, Ї, and Ў in both upper case and lower case. It is even more closely related to KOI8-U, which does not include Ў but otherwise makes the same letter replacements. The additional letter allocations are matched by KOI8-E, except for Ґ which is added to KOI8-F.

ISO-IR-153 is an 8-bit character set that covers the Russian and Bulgarian alphabets. Unlike the KOI encodings, this encoding lists the Cyrillic letters in their correct traditional order. This has become the basis for ISO/IEC 8859-5 and the Cyrillic Unicode block.

KOI8-B is the informal name for an 8-bit Roman / Cyrillic character set constituting the common subset of the major KOI-8 variants. Accordingly, it is closely related to KOI8-R, but defines only the letter subset in the upper half. As such it was implemented by some font vendors for PC Unixes like Xenix in the late 1980s.

ISO-IR-111 or KOI8-E is an 8-bit character set. It is a multinational extension of KOI-8 for Belarusian, Macedonian, Serbian, and Ukrainian. The name "ISO-IR-111" refers to its registration number in the ISO-IR registry, and denotes it as a set usable with ISO/IEC 2022.

References

  1. Nechayev, Valentin (2013) [2001]. "Review of 8-bit Cyrillic encodings universe". Archived from the original on 2016-12-05. Retrieved 2016-12-05.
  2. 1 2 3 Czyborra, Roman (1998-11-30) [1998-05-25]. "The Cyrillic Charset Soup". Archived from the original on 2016-12-03. Retrieved 2016-12-03.
  3. "KOI8 Unified". Fingertip Software. Archived from the original on 1998-01-09. Retrieved 2020-02-11.
  4. 1 2 Leisher, Mark (2008) [1998-03-05]. "KOI8 Unified Cyrillic to Unicode 2.1 mapping table". Department of Mathematical Sciences, New Mexico State University. Retrieved 2020-05-02.
  5. Discussion