This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these template messages)
|
Original author(s) | Michael R Sweet |
---|---|
Developer(s) | Michael R Sweet |
Initial release | 1997 |
Stable release | 1.9.17 / 17 September 2023 [1] |
Repository | github |
Written in | C, C++ |
Operating system | Windows 2000+, macOS 10.9+, Linux, UNIX |
Platform | POSIX |
License | GNU GPLv2 |
Website | www |
HTMLDOC is a previously commercially developed open-source program that converts HTML and Markdown web pages and files to EPUB, indexed HTML, PostScript, and PDF files, complete with a table of contents. HTMLDOC can be used from the command line, a simple GUI, or from a web server. Development originally occurred through the author's now-defunct business, Easy Software Products, and now continues on the author's personal web site.
HTMLDOC 1.9 supports most of HTML 3.2 with some elements of HTML 4.01, it has limited support for Unicode and no support for CSS and PDF forms. [2]
HTMLDOC 1.9 supports the following character sets: Windows-874, Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257, Windows-1258, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-14, ISO-8859-15, KOI8-R; you cannot mix characters from different code pages. There is no support for CJK and Arabic characters, and support for ISO-8859-13 is missing. Support for UTF-8 is limited mainly to Western, Latin-alphabet-based, left-to-right-written languages. HTMLDOC 1.9 uses several proprietary processing instructions for formatting the pdf output, these use the syntax of the HTML comments. [3]
There are no plans for introducing the CSS support or broader Unicode support. [2]
Licensed under the terms of the GNU General Public License version 2. It is legal to compile the sources and distribute the program, and various versions can be found on the Internet. For example, HTMLDOC is included as part of the Debian operating systems. [4]
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".
While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.
ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.
In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.
Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.
UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.
Mojibake is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.
In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.
Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. All modern operating systems, including Windows, now use Unicode code points and text encodings by default, which are portable across all of the world's major languages.
ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is designated ECMA-128 by Ecma International and TS 5881 as a Turkish standard. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet. And the uppercase of i is İ; the lowercase of I is ı.
Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages.
In computing and typesetting, a soft hyphen or syllable hyphen, abbreviated SHY, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they are fall on the line end but remain invisible within the line.
In Unix and Unix-like operating systems, iconv is a command-line program and a standardized application programming interface (API) used to convert between different character encodings. "It can convert from any of these encodings to any other, through Unicode conversion."
Code page 850 is a code page used under DOS and Psion's EPOC16 operating systems in Western Europe. Depending on the country setting and system configuration, code page 850 is the primary code page and default OEM code page in many countries, including various English-speaking locales, whilst other English-speaking locales default to use the hardware code page 437.
Windows code page 1253, commonly known by its IANA-registered name Windows-1253 or abbreviated as cp1253, is a Microsoft Windows code page used to write modern Greek. It is not capable of supporting the older polytonic Greek.
Windows-1254 is a code page used under Microsoft Windows, to write Turkish that it was designed for. Characters with codepoints A0 through FF are compatible with ISO 8859-9, but the CR range, which is reserved for C1 control codes in ISO 8859, is instead used for additional characters. It is similar to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet.
Several 8-bit character sets (encodings) were designed for binary representation of common Western European languages, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols. These character sets also happen to support many other languages such as Malay, Swahili, and Classical Latin.
Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.
Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0: