GNU Unifont

Last updated
GNU Unifont
Unifont name in own typeface.png
Category Unicode, Bitmap, Sans-serif
Classification Duospace
Designer(s) Roman Czyborra, Paul Hardy
Date created1998
Glyphs2,096,578
License Source code: GPL-2.0-or-later
Font: GPL-2.0-or-later with Font-exception-2.0, SIL OFL 1.1 (since 13.0.04)
Manual: GFDL-1.3-or-later
Unifont sample v13.0.06.png
Sample
Shown here13.0.06
See all characters
Website unifoundry.com/unifont/
savannah.gnu.org/projects/unifont/
Latest release version15.0.05 [1]   OOjs UI icon edit-ltr-progressive.svg
Latest release date3 June 2023

GNU Unifont is a free Unicode bitmap font created by Roman Czyborra. The main Unifont covers all of the Basic Multilingual Plane (BMP). The "upper" companion covers significant parts of the Supplementary Multilingual Plane (SMP). The "Unifont JP" companion contains Japanese kanji present in the JIS X 0213 character set.

Contents

It is present in most free operating systems and windowing systems such as Linux, XFree86 or the X.Org Server, some embedded firmware such as RockBox, as well as in Minecraft Java Edition. [2] The source code is released under the GPL-2.0-or-later license. The font is released under the GPL-2.0-or-later license with Font-exception-2.0 (embedding the font in a document does not require the document to be placed under the same license); and from version 13.0.04, dual-licensed under SIL Open Font License 1.1. The manual is released under the GFDL-1.3-or-later license.

It became a GNU package in October 2013. The current maintainer is Paul Hardy.

Status

The Unicode Basic Multilingual Plane covers 216 (65,536) code points. Of this number, 2,048 are reserved for special use as UTF-16 surrogate pairs and 6,400 are reserved for private use. This leaves 57,088 code points to which glyphs can be assigned. Some of these code points are special values that do not have an assigned glyph, but most do have assigned glyphs.

As of May 2019, the GNU Unifont has complete coverage of the Basic Multilingual Plane as defined in Unicode 12.1.0. Its companion fonts, Unifont Upper and Unifont CSUR, have significant coverage of the Supplementary Multilingual Plane and the ConScript Unicode Registry, respectively.

For version 12.1.02, Unifont JP was released, which covers 10,000 Japanese kanji present in the JIS X 0213 character set, some of which are in the Supplementary Ideographic Plane. It is derived from Jiskan16, a public domain font.

Incomplete scripts can be added to by any contributor.

Most of the CJK ideographs on the font has been copied from WenQuanYi's Unibit font with permission. [3] :Wen Quan Yi: Spring of Letters

Unifont stores only one glyph per printable Unicode code point. Because of this, it does not feature the OpenType features needed to render scripts with complex layouts correctly, and it does not correctly position the combining diacritics with base letters if these combinations are not encoded in Unicode in their pre-combined form; the contextual forms (including joining types, and subjoined clusters) are not handled as well. This increases the number of glyphs to include in the basic font and it is not currently possible (because of current OpenType limitations) to encode all the needed glyphs to represent all the required combinations that can exist in a single Unicode plane (this is also true for Chinese fonts that cannot cover completely all ideograms currently encoded in two planes, and also in a third plane). Unifont is then intended to only be used as a "last resort" default font, suitable for simple alphabetic scripts, or to render isolated characters, but will make actual texts difficult or sometimes impossible to read correctly. For correctly rendering Indic abugidas (and semitic abjads if they are written with their optional combining diacritics), other fonts should be specified before this one, and additional fonts will be needed to cover Han ideographs encoded in supplementary planes, or to render most historic (or minority modern) scripts not encoded in the BMP.

Distribution

Sample in Japanese and Chinese GNU Unifont Chinese language (Taiwan) Sample.PNG
Sample in Japanese and Chinese

Unifont, as of version 15.0.6, is available in TTF (and OTF), BDF, PCF, .hex, and PSF formats for the "standard build". Only the TrueType build is split into two fonts. [3]

A few "specialized versions" have been built by request and made available by Paul Hardy. These include a bitmap TTF (SBIT) with empty glyphs filled with code-point values for FontForge users to read, a PSF bitmap with glyphs for APL programmers, and single-file versions in Roman's .hex format (see below). [3] The actual organization of the source consists of smaller .hex files to be stitched together and converted to other formats in a build. [4]

Vectorization

Luis Alejandro González Miranda wrote scripts to vectorize and convert the BDF font to TrueType format using FontForge. [5] Paul Hardy adjusted these scripts to handle combining characters (accents, etc.) for the latest TrueType versions. [3] :TrueType Font Generation

.hex format

The GNU Unifont .hex format defines its glyphs as either 8 or 16 pixels in width by 16 pixels in height. Most Western script glyphs can be defined as 8 pixels wide, while other glyphs (notably the Chinese–Japanese–Korean, or CJK set) are typically defined as 16 pixels wide.

The unifont.hex file contains one line for each glyph. Each line consists of a four-digit Unicode hexadecimal code point, a colon, and the bitmap string. The bit string is 32 hexadecimal digits for an 8-pixel-wide glyph, or 64 hexadecimal digits for a 16-pixel-wide glyph. The goal is to create an intermediate format that would facilitate adding new glyphs.

The bit string is converted from hexadecimal to binary. A 1 bit in the binary bit string corresponds to an 'on' pixel. The pixel's bits are stored line by line, from the top to the bottom, in big-endian order.

Example

This is an example font containing one glyph, for ASCII capital 'A'.

0041:0000000018242442427E424242420000 

The first number is the hexadecimal Unicode code point, with range 0000 through FFFF. Hexadecimal 0041 is decimal 65, the code point for the letter 'A'. The colon separates the code point from the bitmap. In this example, the glyph is 8 pixels wide, so the bit string is 32 hexadecimal digits long.

The bit string begins with 8 zeros, so the top 4 rows will be empty (2 hexadecimal digits per 8 bit byte, with 8 bits per row for an 8 pixel-wide glyph). The bit string also ends with 4 zeros, so the bottom 2 rows will be empty. It is implicit from this that the default font descender is 2 rows below the baseline, and the capital height is 10 rows above the baseline. This is the case in the GNU Unifont with Latin glyphs.

Over time, a number of ways have been created to handle the format. The earliest way is the hexdraw Perl script, which converts the string into an ASCII art representation to be edited in a text editor. Another method involves generating a bitmap image grid for an entire range of code points and working with an image editor. In either case, the edited glyphs are later converted back into .hex files for storage. [4]

History

Roman Czyborra created the Unifont format in 1998 [6] after earlier efforts dating to 1994.

In 2008, Luis Alejandro González Miranda wrote a program to convert Unifont into a TrueType font. Paul Hardy modified it later to support combining characters in the TrueType version.

Later, Richard Stallman published Unifont as a GNU package in October 2013, with Paul Hardy as its maintainer.

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

TrueType is an outline font standard developed by Apple in the late 1980s as a competitor to Adobe's Type 1 fonts used in PostScript. It has become the most common format for fonts on the classic Mac OS, macOS, and Microsoft Windows operating systems.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, which is maintained by the Unicode Consortium, defines as of the current version (15.0) 149,186 characters covering 161 modern and historic scripts, as well as symbols, thousands of emoji, and non-visual control and formatting codes.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

Metafont is a description language used to define raster fonts. It is also the name of the interpreter that executes Metafont code, generating the bitmap fonts that can be embedded into e.g. PostScript. Metafont was devised by Donald Knuth as a companion to his TeX typesetting system.

OpenType is a format for scalable computer fonts. Derived from TrueType, it retains TrueType's basic structure but adds many intricate data structures for describing typographic behavior. OpenType is a registered trademark of Microsoft Corporation.

The ConScript Unicode Registry is a discontinued volunteer project to coordinate the assignment of code points in the Unicode Private Use Areas (PUA) for the encoding of artificial scripts including those for constructed languages. It was founded by John Cowan and was maintained by him and Michael Everson but has not been updated since 2008 and is no longer actively maintained. It has no formal connection with the Unicode Consortium. Since the discontinuation of the CSUR, a project named the Under-ConScript Unicode Registry has maintained the addition of and updates to scripts in the registry, including Cirth and Toki Pona.

A Unicode block is one of several contiguous ranges of numeric character codes of the Unicode character set that are defined by the Unicode Consortium for administrative and documentation purposes. Typically, proposals such as the addition of new glyphs are discussed and evaluated by considering the relevant block or blocks as a whole.

<span class="mw-page-title-main">Open-source Unicode typefaces</span>

There are Unicode typefaces which are open-source and designed to contain glyphs of all Unicode characters, or at least a broad selection of Unicode scripts. There are also numerous projects aimed at providing only a certain script, such as the Arabeyes Arabic font. The advantage of targeting only some scripts with a font was that certain Unicode characters should be rendered differently depending on which language they are used in, and that a font that only includes the characters a certain user needs will be much smaller in file size compared to one with many glyphs. Unicode fonts in modern formats such as OpenType can in theory cover multiple languages by including multiple glyphs per character, though very few actually cover more than one language's forms of the unified Han characters.

The Glyph Bitmap Distribution Format (BDF) by Adobe is a file format for storing bitmap fonts. The content takes the form of a text file intended to be human- and computer-readable. BDF is typically used in Unix X Window environments. It has largely been replaced by the PCF font format which is somewhat more efficient, and by scalable fonts such as OpenType and TrueType fonts.

In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols, control characters, or formatting. The set of all possible code points within a given encoding/character set make up that encoding's codespace.

A fallback font is a reserve typeface containing symbols for as many Unicode characters as possible. When a display system encounters a character that is not part of the repertoire of any of the other available fonts, a symbol from a fallback font is used instead. Typically, a fallback font will contain symbols representative of the various types of Unicode characters.

A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even only support the basic Latin alphabet. Fonts which support a wide range of Unicode scripts and Unicode symbols are sometimes referred to as "pan-Unicode fonts", although as the maximum number of glyphs that can be defined in a TrueType font is restricted to 65,535, it is not possible for a single font to provide individual glyphs for all defined Unicode characters. This article lists some widely used Unicode fonts that support a comparatively large number and broad range of Unicode characters.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

<span class="mw-page-title-main">Code2000</span> Typeface

Code2000 is a serif and pan-Unicode digital font, which includes characters and symbols from a very large range of writing systems. As of the current final version 1.171 released in 2008, Code2000 is designed and implemented by James Kass to include as much of the Unicode 5.2 standard as practical, and to support OpenType digital typography features. Code2000 supports the Basic Multilingual Plane. Code2001 was a designed to support the Supplementary Multilingual Plane, with ISO 8859-1 characters shared with Code2000 for compatibility. A third font, Code2002, was left substantially unfinished and never officially released.

<span class="mw-page-title-main">Panorama (typesetting software)</span>

Panorama is a line layout and text composition engine to render text in various worldwide languages made by Bitstream Inc. Panorama uses Font Fusion as the base to support rendering of the text. The engine allows the user to manage different text formatting aspects like spacing, alignment, style effects.

<span class="mw-page-title-main">Unicode input</span> Input characters using their Unicode code points

Unicode input is the insertion of a specific Unicode character on a computer by a user; it is a common way to input characters not directly supported by a physical keyboard. Unicode characters can be produced either by selecting them from a display or by typing a certain sequence of keys on a physical keyboard. In addition, a character produced by one of these methods in one web page or document can be copied into another. In contrast to ASCII's 96 element character set, Unicode encodes hundreds of thousands of graphemes (characters) from almost all of the world's written languages and many other signs and symbols besides.

The Unicode Standard assigns various properties to each Unicode character and code point.

Tamil All Character Encoding (TACE16) is a 16-bit Unicode-based character encoding scheme for Tamil language.

The implementation of emojis on different platforms took place across a three-decade period, starting in the 1990s. Today, the exact appearance of emoji is not prescribed but can vary between fonts and platforms, much like different typefaces.

References

  1. Paul Hardy (3 June 2023). "Unifont 15.0.05 Released" . Retrieved 18 June 2023.
  2. "Minecraft 1.20 Pre-Release 6". Minecraft Official Site. 25 May 2023. Retrieved 25 June 2023.
  3. 1 2 3 4 GNU Unifont Glyphs, archived from the original on 2013-11-12, retrieved 2008-07-16
  4. 1 2 "Unifoundry Unicode Utilities". unifoundry.com. Archived from the original on 4 April 2019. Retrieved 16 April 2019.
  5. GNU Unifont in TrueType format, archived from the original on 2016-02-01
  6. "Roman Czyborra's GNU Unifont page". Archived from the original on 2011-08-27. Retrieved 2009-06-03.