Well-formed document

Last updated

A well-formed document in XML is a document that "adheres to the syntax rules specified by the XML 1.0 specification in that it must satisfy both physical and logical structures". [1]

Contents

Requirements

At its base level well-formed documents require that:

To be a well-formed document, rules must be established about the declaration and treatment of entities. Tags are case sensitive, with attributes delimited with quotation marks. Empty elements have rules established. Overlapping tags invalidate a document. Ideally, a well-formed document conforms to the design goals of XML. Other key syntax rules provided in the specification include:

A valid XML document is defined in the XML specification as a well-formed XML document which also conforms to the rules of a Document Type Definition (DTD). According to JavaCommerce.com XML tutorial, "Well formed XML documents simply markup pages with descriptive tags. You don't need to describe or explain what these tags mean. In other words a well formed XML document does not need a DTD, but it must conform to the XML syntax rules. If all tags in a document are correctly formed and follow XML guidelines, then a document is considered as well formed." [2] [3]

An XML processor that encounters a violation of the well-formedness rules is required to report such errors and to cease normal processing. This policy, occasionally referred to as draconian, [4] stands in notable contrast to the behavior of programs that process HTML, which are designed to produce a reasonable result even in the presence of severe markup errors [5] in the spirit of Postel's law ("Be conservative in what you send; be liberal in what you accept"). [6] [4]

Importance

The concept of a well-formed document allows for a better understanding of the fundamental construction of XML. It helps to clarify XML beyond the typical sense of it. For example, while most XML Document Type Definitions utilize left and right angle brackets as content delimiters, strictly speaking this is not a necessity (though a delimiter should be terse and concise). The left and right angle bracket codes are a convention, albeit clear and distinctive, not an absolute requirement.

The concept of well-formed document also allows for the comprehension of the abstract nature of XML. In reality, there is no such thing as XML.[ citation needed ] Rather, XML is a principle that represents a set of behaviors and practices. It is possible to discuss types of XML, as expressed within a Document Type Definition (DTD).

Well-formed documents also bring into focus the issue of valid versus correct XML. According to the W3 Organization, valid documents are those that validate against a DTD. The rules of validity mean that a document complies with the restraints stated within a DTD. Thus, tags or entities must be in conformity to the rules and relations established within a DTD. However, there is no control on whether a tag or entity is correct. Thus a first level head tag could be applied to a second level head object and be valid, while incorrect.

The emphasis on well-formed documents has developed within the publishing industry where the use of left and right angle bracket delimited information has become problematic. [ citation needed ] Emphasis on the well-formed document allows for the definition, delimiting, and nesting of content to be managed within programs that are not XML, per se, but exhibit the characteristics or potential for being well formed.

Validation tools

There are several tools available to determine if a given XML document is well formed.

See also

Related Research Articles

A document type definition (DTD) is a set of markup declarations that define a document type for an SGML-family markup language.

HTML Hypertext Markup Language

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

Markup language Modern system for annotating a document

Markup refers to data included in an electronic document which is distinct from the document's content in that it is typically not included in representations of the document for end users, for example on paper or a computer screen, or in an audio stream. Markup is often used to control the display of the document or to enrich its content to facilitate automated processing. A markup language is a set of rules governing what markup information may be included in a document and how it is combined with the content of the document in a way to facilitate use by humans and computer programs. The idea and terminology evolved from the "marking up" of paper manuscripts, which is traditionally written with a red pen or blue pencil on authors' manuscripts.

Standard Generalized Markup Language Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in a HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

XML Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.

XSD, a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in.

An HTML element is a type of HTML document component, one of several types of HTML nodes. HTML document is composed of a tree of simple HTML nodes, such as text nodes, and HTML elements, which add semantics and formatting to parts of document. Each element can have HTML attributes specified. Elements can also have content, including other elements and text.

In web page design, and generally for all markup languages such as SGML, HTML, and XML, a well-formed element is one that is either a) opened and subsequently closed, or b) an empty element, which in that case must be terminated; and in either case which is properly nested so that it does not overlap with other elements.

In web development, "tag soup" is a pejorative for syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated HTML syntax or structural errors leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

In the Standard Generalized Markup Language (SGML), an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word. Entities are foundational to the organizational structure and definition of SGML documents. The SGML specification defines numerous entity types, which are distinguished by keyword qualifiers and context. An entity string value may variously consist of plain text, SGML tags, and/or references to previously defined entities. Certain entity types may also invoke external documents. Entities are called by reference.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

XML documents have a hierarchical structure and can conceptually be interpreted as a tree structure, called an XML tree.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

In computing, a polyglot markup is a document or script written in a valid form of multiple markup languages, which performs the same output, independent of the markup's parser, layout engine, or interpreter. In general, the polyglot markup is a common subset of two or more languages, that can be used as a robust or simplified profile.

A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document with a document type definition (DTD). In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.

References

  1. "XML: Document". The UK Web Design Company. Retrieved 11 August 2013.[ dead link ]
  2. "Well formed XML documents". JCommerce Dev Network. Archived from the original on August 22, 2009.
  3. "There are no exceptions to Postel's Law". Dive into Mark. Internet Archive. Archived from the original on May 10, 2013. Retrieved 11 August 2013.
  4. 1 2 "Dracon and Postel", 2003/08/19, Tim Bray
  5. "The history of draconian error handling in XML". Dive into Mark. Internet Archive. Archived from the original on August 18, 2013. Retrieved 11 August 2013.
  6. "Postel’s Law Has No Exceptions", August 18, 2003 Aaron Swartz