Structured document

Last updated November 08, 2025

A structured document is an electronic document where some method of markup is used to identify the whole and parts of the document as having various meanings beyond their formatting. For example, a structured document might identify a certain portion as a "chapter title" (or "code sample" or "quatrain") rather than as "Helvetica bold 24" or "indented Courier". Such portions in general are commonly called "components" or "elements" of a document.

Overview

Structured documents generally focus on labeling things that can be used for a variety of processing purposes, not merely formatting. For example, explicit labeling of "chapter title" or "emphasis" is far more useful to systems for the visually impaired, than merely "Helvetica bold 24" or "italic". In the same way, meaningful labeling of the many items on a technical information sheet enables far better integration with databases, search systems, online catalogs, and so on.

Structured documents generally support at least hierarchical structures, for example lists, not merely list items; sections, not merely section headings; and so on. This is in stark contrast to formatting-oriented systems. High-end systems also support multiple independent and/or overlapping sets of components.^[1]

Structured document systems allow defining component types and their combinations through a "schema," similar to database schemas. Formal languages for specifying schemas include XSD, Relax NG, and Schematron. A document that follows these rules is considered "valid."^[2] Some systems support flexible component types while maintaining syntactic rules.

Lie and Saarela noted the "Standard Generalized Markup Language (SGML) has pioneered the concept of structured documents",^[3] although earlier systems such as Scribe, Augment, and FRESS provided many structured-document features and capabilities, and SGML's offspring XML is now favored.

One very widely used representation for structured documents is HTML, a schema defined and described by the W3C. However, HTML has not only tags for meaning-oriented components such as paragraph, title, and code; but also format-oriented ones such as italic, bold, and most table. In practice, HTML is sometimes used as a structured document system, but often used as a formatting language.

Many domains use structured documents via domain-specific schemas they have co-operatively developed, such as JATS for journal publishing, TEI for literary documents, UBL and EDI for business interchange, XTCE for spacecraft telemetry, REST for Web interfaces, and countless more. All these cases use specific schemas based on XML.

XML is the universal format for structured documents and data on the Web

— XHTML2 Working Group, W3C

Structural semantics

In writing structured documents the focus is on encoding the logical structure of a document, with less or even no explicit work devoted to its presentation to humans by printed pages or screens (in some cases, no such use is even expected). Structured documents can easily be processed by computer systems to extract and present derivative forms of the document. In most Wikipedia articles for example, a table of contents is automatically generated from the different heading tags in the body of the document. Because the SGML conversion of the Oxford English Dictionary explicitly distinguished the many different meanings which attach to the print version's use of italics, search tools can retrieve entries based on etymology, quotations, and many other features of interest. When HTML provides structural rather than merely formatting information, visually impaired users can be easily given a more useful reading interface. When travel companies provide itineraries as structured documents rather than just displays, user tools can easily extract the necessary facts and pass them on to calendar or other applications.

In HTML a part of the logical structure of a document may be the document body; <body>, containing a first level heading; <h1>, and a paragraph; .

<body><h1>Structured document</h1><p>A <strongclass="selflink">structured document</strong> is an <ahref="/wiki/Electronic_document"title="Electronic document">electronic document</a> where some method of <ahref="/wiki/Markup_language"title="Markup language">markup</a> is used to identify the whole and parts of the document as having various meanings beyond their formatting.</p></body>

One of the most attractive features of structured documents is that they can be reused in many contexts and presented in various ways on mobile phones, TV screens, speech synthesisers, and any other device which can be programmed to process them.

Other semantics

Other meaning can be ascribed to non-structural text in the same sense as larger objects, but is still considered "document structure" because it expresses claims about the scope and nature or ontology of portions of a document, rather than instructions about its presentation. In the HTML fragment above, the  element means that the enclosed text is emphatic. In visual terms this commonly rendered via bold, just like ; but a speech interface would instead likely use voice inflection. The term semantic markup excludes markup like  which directly expresses no meaning other than an instruction to a visual display, although an intelligent agent may discern an underlying structural meaning. The "strong" tag is "descriptive" or "structural" in that it is intended to label an abstract, quasi-linguistic property of its content, rather than describing the appropriate presentation in some particular medium.

Some other structural tags in HTML include <abbr>, <acronym>, <address>, <cite>, <del>, <dfn>, <ins>, <kbd>, and <q>. Other schemas such as DocBook and TEI have far larger selections.

The anchor <a> tag is used for another slightly different kind of structure: the interconnection or cross-reference structure, rather than the interval section division. This is a type of structure, and one can create alternate markup for documents that expresses the same particular structures in either way, e.g. representing section contents with transclusion rather than navigational hyperlink presentations.

HTML from early on has had tags which express presentational semantics, such as bold () or italic (), or to alter font sizes or which had other effects on the presentation.^[4] Modern versions of markup languages discourage such markup in favor of descriptive markup that is mapped to particular presentations via style sheets, a method pioneered by systems such as Scribe and FRESS. Different style sheets can be attached to any markup, semantic or presentational, to produce different presentations, although mapping a tag name "italic" to boldface presentation is less intuitive.

Context and intent

In principle, just what constitutes "structure" vs. non-structure can vary. In a book specifically about typography, tagging something as "italic" or "bold" may well be the whole point. For example, a discussion of when to use particular styles will likely want to give examples and counter-examples, which would no longer make sense if the rendering is not in sync with the prose. Similarly, a particular edition of a document may be of interest not only for its content but for its typographic practice as well, in which case describing that practice is not only desirable but necessary. This problem is not unique to document structure, however; it also arises in grammar when discussing grammar, and in many other cases.

References

↑ DeRose, Steven (2004). Markup Overlap: A Review and a Horse. Extreme Markup Languages 2004. Montréal. CiteSeerX 10.1.1.108.9959 . Retrieved 14 October 2014.
↑ "Document type definitions (DTDs) - overview". IBM. 5 March 2021. Archived from the original on 7 March 2025. Retrieved 7 March 2025.
↑ Håkon Wium Lie; Janne Saarela (1998). "Multi-purpose publishing using HTML, XML, and CSS". W3.org . Association for Computing Machinery.
↑ "A sample HTML instance" . Retrieved 5 March 2014.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] DeRose, Steven (2004). Markup Overlap: A Review and a Horse. Extreme Markup Languages 2004. Montréal. CiteSeerX 10.1.1.108.9959 . Retrieved 14 October 2014.

[2] "Document type definitions (DTDs) - overview". IBM. 5 March 2021. Archived from the original on 7 March 2025. Retrieved 7 March 2025.

[3] Håkon Wium Lie; Janne Saarela (1998). "Multi-purpose publishing using HTML, XML, and CSS". W3.org . Association for Computing Machinery.

[4] "A sample HTML instance" . Retrieved 5 March 2014.

[1]

[2]

[3]

[4]