Structured document

Last updated

A structured document is an electronic document where some method of markup is used to identify the whole and parts of the document as having various meanings beyond their formatting. For example, a structured document might identify a certain portion as a "chapter title" (or "code sample" or "quatrain") rather than as "Helvetica bold 24" or "indented Courier". Such portions in general are commonly called "components" or "elements" of a document.

Contents

Overview

Structured documents generally focus on labeling things that can be used for a variety of processing purposes, not merely formatting. For example, explicit labeling of "chapter title" or "emphasis" is far more useful to systems for the visually impaired, than merely "Helvetica bold 24" or "italic". In the same way, meaningful labeling of the many items on a technical information sheet enables far better integration with databases, search systems, online catalogs, and so on.

Structured documents generally support at least hierarchical structures, for example lists, not merely list items; sections, not merely section headings; and so on. This is in stark contrast to formatting-oriented systems. High-end systems also support multiple independent and/or overlapping sets of components. [1]

Structured document systems commonly permit creating explicit rules defining component types and how they may be combined. Such a set of rules is called a "schema" by analogy with database schemas. Several formal languages exist for specifying them, such as XSD, Relax NG, and Schematron. A structured document which obeys the rules of the schema is commonly called "valid according to that schema". Some systems also support documents with component of arbitrary types and combinations, but still with syntactic rules for how those components are identified.

Lie and Saarela noted the "Standard Generalized Markup Language (SGML) has pioneered the concept of structured documents", [2] although earlier systems such as Scribe, Augment, and FRESS provided many structured-document features and capabilities, and SGML's offspring XML is now favored.

One very widely used representation for structured documents is HTML, a schema defined and described by the W3C. However, HTML has not only tags for meaning-oriented components such as paragraph, title, and code; but also format-oriented ones such as italic, bold, and most table. In practice, HTML is sometimes used as a structured document system, but often used as a formatting language.

Many domains use structured documents via domain-specific schemas they have co-operatively developed, such as JATS for journal publishing, TEI for literary documents, UBL and EDI for business interchange, XTCE for spacecraft telemetry, REST for Web interfaces, and countless more. All these cases use specific schemas based on XML.

XML is the universal format for structured documents and data on the Web

Structural semantics

In writing structured documents the focus is on encoding the logical structure of a document, with less or even no explicit work devoted to its presentation to humans by printed pages or screens (in some cases, no such use is even expected). Structured documents can easily be processed by computer systems to extract and present derivative forms of the document. In most Wikipedia articles for example, a table of contents is automatically generated from the different heading tags in the body of the document. Because the SGML conversion of the Oxford English Dictionary explicitly distinguished the many different meanings which attach to the print version's use of italics, search tools can retrieve entries based on etymology, quotations, and many other features of interest. When HTML provides structural rather than merely formatting information, visually impaired users can be easily given a more useful reading interface. When travel companies provide itineraries as structured documents rather than just displays, user tools can easily extract the necessary facts and pass them on to calendar or other applications.

In HTML a part of the logical structure of a document may be the document body; <body>, containing a first level heading; <h1>, and a paragraph; <p>.

<body><h1>Structured document</h1><p>A <strongclass="selflink">structured document</strong> is an <ahref="/wiki/Electronic_document"title="Electronic document">electronic document</a> where some method of <ahref="/wiki/Markup_language"title="Markup language">markup</a> is used to identify the whole and parts of the document as having various meanings beyond their formatting.</p></body>

One of the most attractive features of structured documents is that they can be reused in many contexts and presented in various ways on mobile phones, TV screens, speech synthesisers, and any other device which can be programmed to process them.

Other semantics

Other meaning can be ascribed to text which isn't "structural" in quite the same sense as larger objects, but is still considered "document structure" because it expresses claims about the scope and nature or ontology of portions of a document, rather than instructions about its presentation. In the HTML fragment above, the <strong> element means that the enclosed text is emphatic. In visual terms this commonly rendered via bold, just like <b>; but a speech interface would instead likely use voice inflection. The term semantic markup excludes markup like <b> which directly expresses no meaning other than an instruction to a visual display (although an intelligent agent may be able to discern a structural meaning lurking behind the tag). The "strong" tag is "descriptive" or "structural" in that it is intended to label an abstract, quasi-linguistic property of its content, rather than to describe the appropriate presentation in some particular medium.

Some other structural tags in HTML include <abbr>, <acronym>, <address>, <cite>, <del>, <dfn>, <ins>, <kbd>, and <q>. Other schemas such as DocBook and TEI have far larger selections.

The anchor <a> tag is used for another slightly different kind of structure, namely the interconnection or cross-reference structure, rather than the interval section division. This is most definitely structure, and in fact it is possible to create alternate markup for documents that expresses the same particular structures in either way (for example, using transclusion to represent section contents, rather than navigational hyperlink presentations).

HTML from early on has also had tags which express presentational semantics, such as bold (<b>) or italic (<i>), or to alter font sizes or which had other effects on the presentation. [3] Modern versions of markup languages discourage such markup in favor of descriptive markup which is mapped to particular presentations via style sheets, a method pioneered by systems such as Scribe and FRESS. Different style sheets can be attached to any markup, semantic or presentational, to produce different presentations, although mapping an tag name "italic" to boldface presentation is not entirely intuitive.

Context and intent

In principle, just what constitutes "structure" vs. non-structure can vary. In a book specifically about typography, tagging something as "italic" or "bold" may well be the whole point. For example, a discussion of when to use particular styles will likely want to give examples and counter-examples, which would no longer make sense if the rendering is not in sync with the prose. Similarly, a particular edition of a document may be of interest not only for its content but for its typographic practice as well, in which case describing that practice is not only desirable but necessary. This problem is not unique to document structure, however; it also arises in grammar when discussing grammar, and in many other cases.

See also

Related Research Articles

A document type definition (DTD) is a set of markup declarations that define a document type for an SGML-family markup language.

HTML Hypertext Markup Language

The HyperText Markup Language, or HTML(HyperText Markup Language) is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

Markup language Modern system for annotating a document

In computer text processing, a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text, meaning when the document is processed for display, the markup language is not shown, and is only used to format the text. The idea and terminology evolved from the "marking up" of paper manuscripts, which is traditionally written with a red pen or blue pencil on authors' manuscripts. Such "markup" typically includes both content corrections, and also typographic instructions, such as to make a heading larger or boldface.

Standard Generalized Markup Language Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

XML Markup language developed by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.

YAML Human-readable data serialization format

YAML is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from SGML. It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps so that JSON files are valid YAML 1.2.

In web development, "tag soup" is a pejorative for syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated HTML syntax or structural errors leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

Text Encoding Initiative An academic community concerned with practices for semantic markup of texts

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains an eponymous technical standard, a journal, a wiki, a GitHub repository and a toolchain.

An XML editor is a markup language editor with added functionality to facilitate the editing of XML. This can be done using a plain text editor, with all the code visible, but XML editors have added facilities like tag completion and menus and buttons for tasks that are common in XML editing, based on data supplied with document type definition (DTD) or the XML tree.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

The following tables compare general and technical information for a number of document-markup languages. Please see the individual markup languages' articles for further information.

Generalized Markup Language (GML) is a set of macros that implement intent-based (procedural) markup tags for the IBM text formatter, SCRIPT. SCRIPT/VS is the main component of IBM's Document Composition Facility (DCF). A starter set of tags in GML is provided with the DCF product.

In the Standard Generalized Markup Language (SGML), an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word. Entities are foundational to the organizational structure and definition of SGML documents. The SGML specification defines numerous entity types, which are distinguished by keyword qualifiers and context. An entity string value may variously consist of plain text, SGML tags, and/or references to previously defined entities. Certain entity types may also invoke external documents. Entities are called by reference.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document with a document type definition (DTD). In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.

In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.

References

  1. DeRose, Steven (2004). Markup Overlap: A Review and a Horse. Extreme Markup Languages 2004. Montréal. CiteSeerX   10.1.1.108.9959 . Retrieved 2014-10-14.
  2. Håkon Wium Lie; Janne Saarela (1998). "Multi-purpose publishing using HTML, XML, and CSS". W3.org . Association for Computing Machinery.
  3. "A sample HTML instance" . Retrieved 5 March 2014.