Filename extension | .dbk, .xml |
---|---|
Internet media type | application/docbook+xml |
Developed by | OASIS |
Type of format | markup language |
Extended from | SGML, XML |
Standard | 5.2 (February 2024), 4.5 (October 2006) |
Open format? | Yes |
DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation. [1]
As a semantic language, DocBook enables its users to create document content in a presentation-neutral form that captures the logical structure of the content; that content can then be published in a variety of formats, including HTML, XHTML, EPUB, PDF, man pages, WebHelp [2] and HTML Help, without requiring users to make any changes to the source. In other words, when a document is written in DocBook format it becomes easily portable into other formats, rather than needing to be rewritten.
DocBook is an XML language. In its current version (5.x), DocBook's language is formally defined by a RELAX NG schema with integrated Schematron rules. (There are also W3C XML Schema+Schematron and Document Type Definition (DTD) versions of the schema available, but these are considered non-standard.)
As a semantic language, DocBook documents do not describe what their contents "look like", but rather the meaning of those contents. For example, rather than explaining how the abstract for an article might be visually formatted, DocBook simply says that a particular section is an abstract. It is up to an external processing tool or application to decide where on a page the abstract should go and what it should look like or whether or not it should be included in the final output at all.
DocBook provides a vast number of semantic element tags. They are divided into three broad categories, namely structural, block-level, and inline.
Structural tags specify broad characteristics of their contents. The book
element, for example, specifies that its child elements represent the parts of a book. This includes a title, chapters, glossaries, appendices, and so on. DocBook's structural tags include, but are not limited to:
set
: Titled collection of one or more book
s or article
s, can be nested with other setsbook
: Titled collection of chapter
s, article
s, and/or part
s, with optional glossaries, appendices, etc.part
: Titled collection of one or more chapter
s—can be nested with other parts, and may have special introductory textarticle
: Titled, unnumbered collection of block-level elementschapter
: Titled, numbered collection of block-level elements—chapters don't require explicit numbers, a chapter number is the number of previous chapter elements in the XML document plus 1appendix
: Contains text that represents an appendix dedication
: Text represents the dedication of the contained structural elementStructural elements can contain other structural elements. Structural elements are the only permitted top-level elements in a DocBook document.
Block-level tags are elements like paragraph, lists, etc. Not all these elements can directly contain text. Sequential block-level elements render one "after" another. After, in this case, can differ depending on the language. In most Western languages, "after" means below: text paragraphs are printed down the page. Other languages' writing systems can have different directionality; for example, in Japanese, paragraphs are often printed in downward columns, with the columns running from right to left, so "after" in that case would be to the left. DocBook semantics are entirely neutral to these kinds of language-based concepts.
Inline-level tags are elements like emphasis, hyperlinks, etc. They wrap text within a block-level element. These elements do not cause the text to break when rendered in a paragraph format, but typically they cause the document processor to apply some kind of distinct typographical treatment to the enclosed text, by changing the font, size, or similar attributes. (The DocBook specification does say that it expects different typographical treatment, but it does not offer specific requirements as to what this treatment may be.) That is, a DocBook processor doesn't have to transform an emphasis
tag into italics. A reader-based DocBook processor could increase the size of the words, or, a text-based processor could use bold instead of italics.
<?xml version="1.0" encoding="UTF-8"?><bookxml:id="simple_book"xmlns="http://docbook.org/ns/docbook"version="5.0"><title>Verysimplebook</title><chapterxml:id="chapter_1"><title>Chapter1</title><para>Helloworld!</para><para>Ihopethatyourdayisproceeding<emphasis>splendidly</emphasis>!</para></chapter><chapterxml:id="chapter_2"><title>Chapter2</title><para>Helloagain,world!</para></chapter></book>
Semantically, this document is a "book", with a "title", that contains two "chapters" each with their own "titles". Those "chapters" contain "paragraphs" that have text in them. The markup is fairly readable in English.
In more detail, the root element of the document is book
. All DocBook elements are in an XML Namespace, so the root element has an xmlns attribute to set the current namespace. Also, the root element of a DocBook document must have a version that specifies the version of the format that the document is built on.
(XML documents can include elements from multiple namespaces at once, like the id
attributes in the example.)
A book
element must contain a title
, or an info
element containing a title
. This must be before any child structural elements. Following the title are the structural children, in this case, two chapter
elements. Each of these must have a title
. They contain para
block elements, which can contain free text and other inline elements like the emphasis
in the second paragraph of the first chapter.
Rules are formally defined in the DocBook XML schema. Appropriate programming tools can validate an XML document (DocBook or otherwise), against its corresponding schema, to determine if (and where) the document fails to conform to that schema. XML editing tools can also use schema information to avoid creating non-conforming documents in the first place.
Because DocBook is XML, documents can be created and edited with any text editor. A dedicated XML editor is likewise a functional DocBook editor. DocBook provides schema files for popular XML schema languages, so any XML editor that can provide content completion based on a schema can do so for DocBook. Many graphical or WYSIWYG XML editors come with the ability to edit DocBook like a word processor. [3]
Tables, list items, and other stylized content can be copied and pasted into the DocBook editor and will be preserved in the DocBook XML output. [3] Because DocBook conforms to a well-defined XML schema, documents can be validated and processed using any tool or programming language that includes XML support.
DocBook began in 1991 in discussion groups on Usenet and eventually became a joint project of HAL Computer Systems and O'Reilly & Associates and eventually spawned its own maintenance organization (the Davenport Group) before moving in 1998 to the SGML Open consortium, which subsequently became OASIS. DocBook is currently maintained by the DocBook Technical Committee at OASIS. [4]
DocBook is available in both SGML and XML forms, as a DTD. RELAX NG and W3C XML Schema forms of the XML version are available. Starting with DocBook 5, the RELAX NG version is the "normative" form from which the other formats are generated.
DocBook originally started out as an SGML application, but an equivalent XML application was developed and has now replaced the SGML one for most uses. (Starting with version 4 of the SGML DTD, the XML DTD continued with this version numbering scheme.) Initially, a key group of software companies used DocBook since their representatives were involved in its initial design. Eventually, however, DocBook was adopted by the open source community where it has become a standard for creating documentation for many projects, including FreeBSD, KDE, GNOME desktop documentation, the GTK+ API references, the Linux kernel documentation (which, as of July 2016, is transitioning to Sphinx/reStructuredText [5] [6] ), and the work of the Linux Documentation Project.
Until DocBook 5, DocBook was defined normatively by a Document Type Definition (DTD). Because DocBook was built originally as an application of SGML, the DTD was the only available schema language. DocBook 4.x formats can be SGML or XML, but the XML version does not have its own namespace.
DocBook 4.x formats had to live within the restrictions of being defined by a DTD. The most significant restriction was that an element name uniquely defines its possible contents. That is, an element named info
must contain the same information no matter where it is in the DocBook file. As such, there are many kinds of info elements in DocBook 4.x: bookinfo
, chapterinfo
, etc. Each has a slightly different content model, but they do share some of their content model. Additionally, they repeat context information. The book's info
element is that, because it is a direct child of the book; it does not need to be named specially for a human reader. However, because the format was defined by a DTD, it did have to be named as such. The root element does not have or need a version, as the version is built into the DTD declaration at the top of a pre-DocBook 5 document.
DocBook 4.x documents are not compatible with DocBook 5, but can be converted into DocBook 5 documents via an XSLT stylesheet. One (db4-upgrade.xsl
) is provided as part of the distribution of the DocBook 5 schema and specification package. [7]
DocBook files are used to prepare output files in a wide variety of formats. Nearly always, this is accomplished using DocBook XSL stylesheets. These are XSLT stylesheets that transform DocBook documents into a number of formats (HTML, XSL-FO for later conversion into PDF, etc.). These stylesheets can be sophisticated enough to generate tables of contents, glossaries, and indexes. They can oversee the selection of particular designated portions of a master document to produce different versions of the same document (such as a "tutorial" or a "quick-reference guide", where each of these consist of a subset of the material). Users can write their own customized stylesheets or even a full-fledged program to process the DocBook into an appropriate output format as their needs dictate.
Norman Walsh and the DocBook Project development team maintain the key application for producing output from DocBook source documents: A set of XSLT stylesheets (as well as a legacy set of DSSSL stylesheets) that can generate high-quality HTML and print (FO/PDF) output, as well as output in other formats, including RTF, man pages and HTML Help.
Web help [2] is a chunked HTML output format in the DocBook XSL stylesheets that was introduced in version 1.76.1. The documentation for web help [8] also provides an example of web help and is part of the DocBook XSL distribution.
The major features are its fully CSS-based page layout, search of the help content, and a table of contents in collapsible-tree form. Search has stemming, match highlighting, explicit page-scoring, and the standard multilingual tokenizer. The search and TOC are in a pane that appears as a frameset, but is actually implemented with div tags and cookies (so that it is progressive).
DocBook offers a large number of features that may be overwhelming to a new user. For those who want the convenience of DocBook without a steep learning curve, Simplified DocBook was designed. It is a small subset of DocBook designed for single documents such as articles or white papers (i.e., "books" are not supported). The Simplified DocBook DTD is currently at version 1.1. [9]
Ingo Schwarze, the author of OpenBSD's mandoc, considers DocBook inferior to the semantic mdoc macro for man pages. In an attempt to write a DocBook-to-mdoc converter (previous converters like docbook-to-man do not cover semantic elements), he finds the semantic parts "bloated, redundant, and incomplete at the same time" compared to elements covered in mdoc. Moreover, Schwarze finds the DocBook specification not specific enough about the use of tags, the language non-portable across versions, rough in details and overall inconsistent. [10]
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.
A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.
The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
In computing, the term Extensible Stylesheet Language (XSL) is used to refer to a family of languages used to transform and render XML documents.
XSLT is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subsequently be converted to other formats, such as PDF, PostScript and PNG. Support for JSON and plain-text transformation was added in later updates to the XSLT 1.0 specification.
In computing, the Java API for XML Processing (JAXP), one of the Java XML application programming interfaces (APIs), provides the capability of validating and parsing XML documents. It has three basic parsing interfaces:
The Document Style Semantics and Specification Language (DSSSL) is an international standard developed to provide stylesheets for SGML documents.
Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a structural schema language expressed in XML using a small number of elements and XPath languages. In many implementations, the Schematron XML is processed into XSLT code for deployment anywhere that XSLT can be used.
XSL-FO is a markup language for XML document formatting that is most often used to generate PDF files. XSL-FO is part of XSL, a set of W3C technologies designed for the transformation and formatting of XML data. The other parts of XSL are XSLT and XPath. Version 1.1 of XSL-FO was published in 2006.
The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.
An XML editor is a markup language editor with added functionality to facilitate the editing of XML. This can be done using a plain text editor, with all the code visible, but XML editors have added facilities like tag completion and menus and buttons for tasks that are common in XML editing, based on data supplied with document type definition (DTD) or the XML tree.
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.
The identity transform is a data transformation that copies the source data into the destination data without change.
The Oxygen XML Editor is a multi-platform XML editor, XSLT/XQuery debugger and profiler with Unicode support. It is a Java application so it can run in Windows, Mac OS X, and Linux. It also has a version that can run as an Eclipse plugin.
The DocBook XSL stylesheets are a set of XSLT stylesheets for the XML-based DocBook language.
A Formal Public Identifier (FPI) is a short piece of text with a particular structure that may be used to uniquely identify a product, specification or document. FPIs were introduced as part of Standard Generalized Markup Language (SGML), and serve particular purposes in formats historically derived from SGML. Some of their most common uses are as part of document type declarations (DOCTYPEs) and document type definitions (DTDs) in SGML, XML and historically HTML, but they are also used in the vCard and iCalendar file formats to identify the software product which generated the file.
A structured document is an electronic document where some method of markup is used to identify the whole and parts of the document as having various meanings beyond their formatting. For example, a structured document might identify a certain portion as a "chapter title" rather than as "Helvetica bold 24" or "indented Courier". Such portions in general are commonly called "components" or "elements" of a document.
A processing instruction (PI) is an SGML and XML node type, which may occur anywhere in a document, intended to carry instructions to the application.
The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012.
Norman Walsh is the principal author of the book DocBook: The Definitive Guide, the official documentation of DocBook. This book is available online under the GFDL, and also as a print publication.