Optical Character Recognition (Unicode block)

Last updated
Optical Character Recognition
RangeU+2440..U+245F
(32 code points)
Plane BMP
Scripts Common
Symbol setsOCR controls
Assigned11 code points
Unused21 reserved code points
Source standards ISO 2033
Unicode version history
1.0.0 (1991)11 (+11)
Unicode documentation
Code chart ∣ Web page
Note: [1] [2]

Optical Character Recognition is a Unicode block containing signal characters for OCR and MICR standards.

Contents

Block

Optical Character Recognition [1] [2]
Official Unicode Consortium code chart (PDF)
 0123456789ABCDEF
U+244x
U+245x
Notes
1. ^ As of Unicode version 15.1
2. ^ Grey areas indicate non-assigned code points

Subheadings

The Optical Character Recognition block has three informal subheadings (groupings) within its character collection: OCR-A, MICR, and OCR. [3]

OCR-A

A partly redacted German cheque, showing use of ,  and in the machine-readable line Verrechnungsscheck, WestLB, Landeshauptkasse Dusseldorf, 2004.jpg
A partly redacted German cheque, showing use of ⑂, ⑀ and ⑁ in the machine-readable line

The OCR-A subheading contains six characters taken from the OCR-A font described in the ISO 1073-1:1976 standard: U+2440OCR HOOK, U+2441OCR CHAIR, U+2442OCR FORK, U+2443OCR INVERTED FORK, U+2444OCR BELT BUCKLE, and U+2445OCR BOW TIE. The OCR bow tie is given the informative alias "unique asterisk".

The hook, chair and fork, in addition to a long vertical bar, are included in the most basic "numeric" implementation level of OCR-A, which includes digits but excludes letters and conventional punctuation. [4] By contrast, the most basic implementation level of OCR-B instead includes the digits, plus sign, less-than sign, greater-than sign, long vertical bar and seven of the capital letters; [5] as such, there are no characters specific to OCR-B in the Optical Character Recognition block.

MICR

A cheque signed by Richard Nixon, showing use of , ,  and in the machine-readable line NIXON, Richard M (signed check).jpg
A cheque signed by Richard Nixon, showing use of ⑆, ⑇, ⑈ and ⑉ in the machine-readable line

The MICR subheading contains four punctuation characters for bank cheque identifiers, taken from the magnetic ink character recognition E-13B font (codified in the ISO 1004:1995 standard): U+2446OCR BRANCH BANK IDENTIFICATION, U+2447OCR AMOUNT OF CHECK, U+2448OCR DASH, and U+2449OCR CUSTOMER ACCOUNT NUMBER.

The latter two characters are misnamed: their names were inadvertently switched when they were named in the 1993 (first) edition of ISO/IEC 10646, [6] a mistake which had been present since Unicode 1.0.0. [7] Although their formal names remain unchanged due to the Unicode stability policy, they both have corrected normative aliases: U+2448 ⑈ is MICR ON US SYMBOL, and U+2449 ⑉ is MICR DASH SYMBOL [8] (the standard notes that "the Unicode character names include several misnomers").

These symbols had previously been encoded by the ISO-IR-98 encoding defined by ISO 2033:1983, in which they were simply named SYMBOL ONE through SYMBOL FOUR. [9] All four characters have informative aliases in the Unicode charts: "transit", "amount", "on us", and "dash" respectively.

OCR

The OCR subheading consists of a single character: U+244AOCR DOUBLE BACKSLASH.

History

The following Unicode-related documents record the purpose and process of defining specific characters in the Optical Character Recognition block:

Related Research Articles

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.

The Coptic script is the script used for writing the Coptic language, the most recent development of Egyptian. The repertoire of glyphs is based on the uncial Greek alphabet, augmented by letters borrowed from the Egyptian Demotic. It was the first alphabetic script used for the Egyptian language. There are several Coptic alphabets, as the script varies greatly among the various dialects and eras of the Coptic language.

Magnetic ink character recognition code, known in short as MICR code, is a character recognition technology used mainly by the banking industry to streamline the processing and clearance of cheques and other documents. MICR encoding, called the MICR line, is at the bottom of cheques and other vouchers and typically includes the document-type indicator, bank code, bank account number, cheque number, cheque amount, and a control indicator. The format for the bank code and bank account number is country-specific.

<span class="mw-page-title-main">Michael Everson</span> American-Irish type designer (born 1963)

Michael Everson is an American and Irish linguist, script encoder, typesetter, type designer and publisher. He runs a publishing company called Evertype, through which he has published over one hundred books since 2006.

The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Areas (PUA) for the encoding of artificial scripts, such as those for constructed languages. It was founded by John Cowan and was maintained by him and Michael Everson. It is not affiliated with the Unicode Consortium.

The Ideographic Research Group (IRG), formerly called the Ideographic Rapporteur Group, is a subgroup of Working Group 2 (WG2) of ISO/IEC JTC1 Subcommittee 2 (SC2), which is the committee responsible for developing the Universal Coded Character Set. IRG is tasked with preparing and reviewing sets of CJK unified ideographs for eventual inclusion in both ISO/IEC 10646 and The Unicode Standard. The IRG is composed of representatives from national standards bodies from China, Japan, South Korea, Vietnam, and other regions that have historically used Chinese characters, as well as experts from liaison organizations such as the Taipei Computer Association (TCA) and the Unicode Technical Committee (UTC). The group holds two meetings every year lasting 4-5 days each, subsequently reporting its activities to its parent ISO/IEC JTC 1/SC 2 (SC2/WG2) committee.

Mahajani is a Laṇḍā mercantile script that was historically used in northern India for writing accounts and financial records in Marwari, Hindi and Punjabi. It is a Brahmic script and is written left-to-right. Mahajani refers to the Hindi word for 'bankers', also known as 'sarrafi' or 'kothival' (merchant).

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

<span class="mw-page-title-main">Tirhuta script</span> Script of Maithili language

The Tirhuta or Maithili script was the primary historical script for the Maithili language, as well as one of the historical scripts for Sanskrit. It is believed to have originated in the 10th century CE. It is very similar to Bengali–Assamese script, with most consonants being effectively identical in appearance. For the most part, writing in Maithili has switched to the Devanagari script, which is used to write neighbouring Central Indic languages to the west and north such as Hindi and Nepali, and the number of people with a working knowledge of Tirhuta has dropped considerably in recent years.

<span class="mw-page-title-main">OCR-A</span> Typeface designed for early computer OCR

OCR-A is a font issued in 1966 and first implemented in 1968. A special font was needed in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, but also by humans. OCR-A uses simple, thick strokes to form recognizable characters. The font is monospaced (fixed-width), with the printer required to place glyphs 0.254 cm apart, and the reader required to accept any spacing between 0.2286 cm and 0.4572 cm.

The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

Runic is a Unicode block containing runic characters. It was introduced in Unicode 3.0 (1999), with eight additional characters introduced in Unicode 7.0 (2014). The original encoding of runes in UCS was based on the recommendations of the "ISO Runes Project" submitted in 1997.

Phaistos Disc is a Unicode block containing the characters found on the undeciphered Phaistos Disc artefact.

Tamil All Character Encoding (TACE16) is a scheme for encoding the Tamil script in the Private Use Area of Unicode, implementing a syllabary-based character model differing from the modified-ISCII model used by Unicode's existing Tamil implementation.

Mahajani is a Unicode block containing characters historically used for writing Punjabi and Marwari.

Modi is a Unicode block containing the Modi alphabet characters for writing the Marathi language.

The Pau Cin Hau scripts, known as Pau Cin Hau lai, or Zo tual lai in Zomi, are two scripts, a logographic script and an alphabetic script created by Pau Cin Hau, a Zomi religious leader from Chin State, Burma. The logographic script consists of 1,050 characters, which is a traditionally significant number based on the number of characters appearing in a religious text. The alphabetic script is a simplified script of 57 characters, which is divided into 21 consonants, 7 vowels, 9 final consonants, and 20 tone, length, and glottal marks. The original script was produced in 1902, but it is thought to have undergone at least two revisions, of which the first revision produced the logographic script.

The ISO 2033:1983 standard defines character sets for use with Optical Character Recognition or Magnetic Ink Character Recognition systems. The Japanese standard JIS X 9010:1984 is closely related.

ISO/IEC 10367:1991 is a standard developed by ISO/IEC JTC 1/SC 2, defining graphical character sets for use in character encodings implementing levels 2 and 3 of ISO/IEC 4873.

CJK Unified Ideographs Extension I is a Unicode block comprising CJK Unified Ideographs included in drafts of an amendment to China's GB 18030 standard circulated in 2022 and 2023, which were fast-tracked into Unicode in 2023.

References

  1. "Unicode character database". The Unicode Standard. Retrieved 2023-07-26.
  2. "Enumerated Versions of The Unicode Standard". The Unicode Standard. Retrieved 2023-07-26.
  3. "Unicode Code Charts: Optical Character Recognition" (PDF). The Unicode Standard, Version 6.3. Retrieved 27 February 2014.
  4. European Computer Manufacturers Association (1977). "Nominal Character Dimensions of the Numeric OCR-A Font" (PDF) (2nd ed.). ECMA-8.
  5. ISO/IEC JTC1/SC2/WG3 (1998-09-28). "9.1: Subset 1: Minimal alphanumeric subset" (PDF). Proposal for Type 3 Technical Report, TR 15907, Information technology—Revision of OCR-B standard (ISO 1073-2:1976). p. 8. ISO/IEC JTC1/SC2/WG3 N470.{{cite web}}: CS1 maint: numeric names: authors list (link)
  6. ISO/IEC JTC 1/SC 2/WG 2 (2012-01-03). "T.3. Optical Character Recognition". Unconfirmed minutes of WG 2 meeting 58 (PDF). p. 29. SC2 N4188 / WG2 N4103. These Magnetic Ink Character Recognition (MICR) symbols are used by banks on checks. The names of these characters were inadvertently mixed up in the 1993 edition of ISO/IEC 10646.{{citation}}: CS1 maint: numeric names: authors list (link)
  7. "3.8: Block-by-Block Charts" (PDF). The Unicode Standard. version 1.0. Unicode Consortium.
  8. Freytag, Asmus; McGowan, Rick; Whistler, Ken (2017-04-10). Known Anomalies in Unicode Character Names (4 ed.). Unicode Consortium. Unicode Technical Note #27.
  9. ISO/TC97/SC2 (1985-08-01). ISO-IR-98: E13B Graphic Character Set (PDF). ITSCJ/IPSJ.{{citation}}: CS1 maint: numeric names: authors list (link)