Document type definition

Last updated

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language (GML, SGML, XML, HTML). The DTD specification file can be used to validate documents.

Contents

A DTD defines the valid building blocks of an XML document. It defines the document structure with a list of validated elements and attributes. A DTD can be declared inline inside an XML document, or as an external reference. [1]

A namespace-aware version of DTDs is being developed as Part 9 of ISO DSDL. DTDs persist in applications that need special publishing characters, such as the XML and HTML Character Entity References, which derive from larger sets defined as part of the ISO SGML standard effort. XML uses a subset of SGML DTD.

As of 2009, newer XML namespace-aware schema languages (such as W3C XML Schema and ISO RELAX NG) have largely superseded DTDs as a better way to validate XML structure.

Associating DTDs with documents

A DTD is associated with an XML or SGML document by means of a document type declaration (DOCTYPE). The DOCTYPE appears in the syntactic fragment doctypedecl near the start of an XML document. [2] The declaration establishes that the document is an instance of the type defined by the referenced DTD.

DOCTYPEs make two sorts of declarations:

The declarations in the internal subset form part of the DOCTYPE in the document itself. The declarations in the external subset are located in a separate text file. The external subset may be referenced via a public identifier and/or a system identifier . Programs for reading documents may not be required to read the external subset.

Any valid SGML or XML document that references an external subset in its DTD, or whose body contains references to parsed external entities declared in its DTD (including those declared within its internal subset), may only be partially parsed but cannot be fully validated by validating SGML or XML parsers in their standalone mode (this means that these validating parsers do not attempt to retrieve these external entities, and their replacement text is not accessible).

However, such documents are still fully parsable in the non-standalone mode of validating parsers, which signals an error if it can not locate these external entities with their specified public identifier (FPI) or system identifier (a URI), or are inaccessible. (Notations declared in the DTD are also referencing external entities, but these unparsed entities are not needed for the validation of documents in the standalone mode of these parsers: the validation of all external entities referenced by notations is left to the application using the SGML or XML parser). Non-validating parsers may eventually attempt to locate these external entities in the non-standalone mode (by partially interpreting the DTD only to resolve their declared parsable entities), but do not validate the content model of these documents.

Examples

The following example of a DOCTYPE contains both public and system identifiers:

<!DOCTYPEhtmlPUBLIC"-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

All HTML 4.01 documents conform to one of three SGML DTDs. The public identifiers of these DTDs are constant and are as follows:

The system identifiers of these DTDs, if present in the DOCTYPE, are URI references. A system identifier usually points to a specific set of declarations in a resolvable location. SGML allows mapping public identifiers to system identifiers in catalogs that are optionally available to the URI resolvers used by document parsing software.

This DOCTYPE can only appear after the optional XML declaration, and before the document body, if the document syntax conforms to XML. This includes XHTML documents:

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><!-- the XHTML document body starts here--><htmlxmlns="http://www.w3.org/1999/xhtml">... </html>

An additional internal subset can also be provided after the external subset:

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [  <!-- an internal subset can be embedded here --> ]> <!-- the XHTML document body starts here--><htmlxmlns="http://www.w3.org/1999/xhtml">... </html>

Alternatively, only the internal subset may be provided:

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html [  <!-- an internal subset can be embedded here --> ]> <!-- the XHTML document body starts here--><htmlxmlns="http://www.w3.org/1999/xhtml">... </html>

Finally, the document type definition may include no subset at all; in that case, it just specifies that the document has a single top-level element (this is an implicit requirement for all valid XML and HTML documents, but not for document fragments or for all SGML documents, whose top-level elements may be different from the implied root element), and it indicates the type name of the root element:

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html><!-- the XHTML document body starts here--><htmlxmlns="http://www.w3.org/1999/xhtml">... </html>

Markup declarations

DTDs describe the structure of a class of documents via element and attribute-list declarations. Element declarations name the allowable set of elements within the document, and specify whether and how declared elements and runs of character data may be contained within each element. Attribute-list declarations name the allowable set of attributes for each declared element, including the type of each attribute value, if not an explicit set of valid values.

DTD markup declarations declare which element types, attribute lists, entities, and notations are allowed in the structure of the corresponding class of XML documents. [3]

Element type declarations

An element type declaration defines an element and its possible content. A valid XML document contains only elements that are defined in the DTD.

Various keywords and characters specify an element's content:

For example:

<!ELEMENThtml(head,body)><!ELEMENTp(#PCDATA|p|ul|dl|table|h1|h2|h3)*>

Element type declarations are ignored by non-validating SGML and XML parsers (in which cases, any elements are accepted in any order, and in any number of occurrences in the parsed document), but these declarations are still checked for form and validity.

Attribute list declarations

An attribute list specifies for a given element type the list of all possible attribute associated with that type. For each possible attribute, it contains:

For example:

<!ATTLISTimgsrcCDATA#REQUIREDidID#IMPLIEDsortCDATA#FIXED"true"print(yes|no)"yes">

Here are some attribute types supported by both SGML and XML:

CDATA
this type means characters data and indicates that the effective value of the attribute can be any textual value, unless the attribute is specified as fixed (the comments in the DTD may further document values that are effectively accepted, but the DTD syntax does not allow such precise specification);
ID
the effective value of the attribute must be a valid identifier, and it is used to define and anchor to the current element the target of references using this defined identifier (including as document fragment identifiers that may be specified at end of an URI after a "#" sign); it is an error if distinct elements in the same document are defining the same identifier; the uniqueness constraint also implies that the identifier itself carries no other semantics and that identifiers must be treated as opaque in applications; XML also predefines the standard pseudo-attribute "xml:id" with this type, without needing any declaration in the DTD, so the uniqueness constraint also applies to these defined identifiers when they are specified anywhere in a XML document.
IDREF or IDREFS
the effective value of the attribute can only be a valid identifier (or a space-separated list of such identifiers) and must be referencing the unique element defined in the document with an attribute declared with the type ID in the DTD (or the unique element defined in an XML document with a pseudo-attribute "xml:id") and whose effective value is the same identifier;
NMTOKEN or NMTOKENS
the effective value of the attribute can only be a valid name token (or a spaced-separated list of such name tokens), but it is not restricted to a unique identifier within the document; this name may carry supplementary and application-dependent semantics and may require additional naming constraints, but this is out of scope of the DTD;
ENTITY or ENTITIES
the effective value of the attribute can only be the name of an unparsed external entity (or a space-separated list of such names), which must also be declared in the document type declaration; this type is not supported in HTML parsers, but is valid in SGML and XML 1.0 or 1.1 (including XHTML and SVG);
(value1|...)
the effective value of the attribute can only be one of the enumerated list (specified between parentheses and separated by a "|" pipe character) of textual values, where each value in the enumeration is possibly specified between 'single' or "double" quotation marks if it's not a simple name token;
NOTATION (notation1|...)
the effective value of the attribute can only be any one of the enumerated list (specified between parentheses and separated by a "|" pipe character) of notation names, where each notation name in the enumeration must also be declared in the document type declaration; this type is not supported in HTML parsers, but is valid in SGML and XML 1.0 or 1.1 (including XHTML and SVG).

A default value can define whether an attribute must occur (#REQUIRED) or not (#IMPLIED), or whether it has a fixed value (#FIXED), or which value should be used as a default value ("…") in case the given attribute is left out in an XML tag.

Attribute list declarations are ignored by non-validating SGML and XML parsers (in which cases any attribute is accepted within all elements of the parsed document), but these declarations are still checked for well-formedness and validity.

Entity declarations

An entity is similar to a macro. The entity declaration assigns it a value that is retained throughout the document. A common use is to have a name more recognizable than a numeric character reference for an unfamiliar character. [5] Entities help to improve legibility of an XML text. In general, there are two types: internal and external.

An example of internal entity declarations (here in an internal DTD subset of an SGML document) is:

<!DOCTYPEsgml[<!ELEMENTsgmlANY><!ENTITY%std"standard SGML"><!ENTITY%signature" &#x2014; &author;."><!ENTITY%question"Why couldn&#x2019;t I publish my books directly in %std;?"><!ENTITY%author"William Shakespeare">]>
<sgml>&question;&signature;</sgml>

Internal entities may be defined in any order, as long as they are not referenced and parsed in the DTD or in the body of the document, in their order of parsing: it is valid to include a reference to a still undefined entity within the content of a parsed entity, but it is invalid to include anywhere else any named entity reference before this entity has been fully defined, including all other internal entities referenced in its defined content (this also prevents circular or recursive definitions of internal entities). This document is parsed as if it was:

<!DOCTYPEsgml[<!ELEMENTsgmlANY><!ENTITY%std"standard SGML"><!ENTITY%signature" — &author;."><!ENTITY%question"Why couldn’t I publish my books directly in standard SGML?"><!ENTITY%author"William Shakespeare">]>
<sgml>Whycouldn’tIpublishmybooksdirectlyinstandardSGML?WilliamShakespeare.</sgml>

Reference to the "author" internal entity is not substituted in the replacement text of the "signature" internal entity. Instead, it is replaced only when the "signature" entity reference is parsed within the content of the "sgml" element, but only by validating parsers (non-validating parsers do not substitute entity references occurring within contents of element or within attribute values, in the body of the document.

This is possible because the replacement text specified in the internal entity definitions permits a distinction between parameter entity references (that are introduced by the "%" character and whose replacement applies to the parsed DTD contents) and general entity references (that are introduced by the "&" character and whose replacement is delayed until they are effectively parsed and validated). The "%" character for introducing parameter entity references in the DTD loses its special role outside the DTD and it becomes a literal character.

However, the references to predefined character entities are substituted wherever they occur, without needing a validating parser (they are only introduced by the "&" character).

Notation declarations

Notations are used in SGML or XML. They provide a complete reference to unparsed external entities whose interpretation is left to the application (which interprets them directly or retrieves the external entity themselves), by assigning them a simple name, which is usable in the body of the document. For example, notations may be used to reference non-XML data in an XML 1.1 document. For example, to annotate SVG images to associate them with a specific renderer:

<!NOTATIONtype-image-svgSYSTEM"image/svg">

This declares the TEXT of external images with this type, and associates it with a notation name "type-image-svg". However, notation names usually follow a naming convention that is specific to the application generating or using the notation: notations are interpreted as additional meta-data whose effective content is an external entity and either a PUBLIC FPI, registered in the catalogs used by XML or SGML parsers, or a SYSTEM URI, whose interpretation is application dependent (here a MIME type, interpreted as a relative URI, but it could be an absolute URI to a specific renderer, or a URN indicating an OS-specific object identifier such as a UUID).

The declared notation name must be unique within all the document type declaration, i.e. in the external subset as well as the internal subset, at least for conformance with XML. [6] [7]

Notations can be associated to unparsed external entities included in the body of the SGML or XML document. The PUBLIC or SYSTEM parameter of these external entities specifies the FPI and/or the URI where the unparsed data of the external entity is located, and the additional NDATA parameter of these defined entities specifies the additional notation (i.e., effectively the MIME type here). For example:

<!DOCTYPEsgml[<!ELEMENTsgml(img)*><!ELEMENTimgEMPTY><!ATTLISTimgdataENTITY#IMPLIED><!ENTITYexample1SVGSYSTEM"example1.svg"NDATAexample1SVG-rdf><!NOTATIONexample1SVG-rdfSYSTEM"example1.svg.rdf">]>
<sgml><imgdata="example1SVG"/></sgml>

Within the body of the SGML document, these referenced external entities (whose name is specified between "&" and ";") are not replaced like usual named entities (defined with a CDATA value), but are left as distinct unparsed tokens that may be used either as the value of an element attribute (like above) or within the element contents, provided that either the DTD allows such external entities in the declared content type of elements or in the declared type of attributes (here the ENTITY type for the data attribute), or the SGML parser is not validating the content.

Notations may also be associated directly to elements as additional meta-data, without associating them to another external entity, by giving their names as possible values of some additional attributes (also declared in the DTD within the <!ATTLIST...> declaration of the element). For example:

<!DOCTYPEsgml[<!ELEMENTsgml(img)*><!--     the optional "type" attribute value can only be set to this notation.   --><!ATTLISTsgmltypeNOTATION(type-vendor-specific)#IMPLIED><!ELEMENTimgANY><!-- optional content can be only parsable SGML or XML data --><!--     The optional "title" attribute value must be parsable as text.     The optional "data" attribute value is set to an unparsed external entity.     The optional "type" attribute value can only be one of the two notations.   --><!ATTLISTimgtitleCDATA#IMPLIEDdataENTITY#IMPLIEDtypeNOTATION(type-image-svg|type-image-gif)#IMPLIED><!--    Notations are referencing external entities and may be set in the "type" attributes above,    or must be referenced by any defined external entities that cannot be parsed.  --><!NOTATIONtype-image-svgPUBLIC"-//W3C//DTD SVG 1.1//EN""http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"><!NOTATIONtype-image-gifPUBLIC"image/gif"><!NOTATIONtype-vendor-specificPUBLIC"application/VND.specific+sgml"><!ENTITYexample1SVGTitle"Title of example1.svg"><!-- parsed internal entity --><!ENTITYexample1SVGSYSTEM"example1.svg"><!-- parsed external entity --><!ENTITYexample1GIFTitle"Title of example1.gif"><!-- parsed internal entity --><!ENTITYexample1GIFSYSTEM"example1.gif"NDATAtype-image-gif><!-- unparsed external entity -->]>
<sgmltype="type-vendor-specific"><!-- an SVG image is parsable as valid SGML or XML text --><imgtitle="&example1SVGTitle;"type="type-image-svg">&example1SVG;</img><!-- it can also be referenced as an unparsed external entity --><imgtitle="&example1SVGTitle;"data="example1SVG"/><!-- a GIF image is not parsable and can only be referenced as an external entity --><imgtitle="&example1GIFTitle;"data="example1GIF"/></sgml>

The example above shows a notation named "type-image-svg" that references the standard public FPI and the system identifier (the standard URI) of an SVG 1.1 document, instead of specifying just a system identifier as in the first example (which was a relative URI interpreted locally as a MIME type). This annotation is referenced directly within the unparsed "type" attribute of the "img" element, but its content is not retrieved. It also declares another notation for a vendor-specific application, to annotate the "sgml" root element in the document. In both cases, the declared notation named is used directly in a declared "type" attribute, whose content is specified in the DTD with the "NOTATION" attribute type (this "type" attribute is declared for the "sgml" element, as well as for the "img" element).

However, the "title" attribute of the "img" element specifies the internal entity "example1SVGTitle" whose declaration that does not define an annotation, so it is parsed by validating parsers and the entity replacement text is "Title of example1.svg".

The content of the "img" element references another external entity "example1SVG" whose declaration also does not define an notation, so it is also parsed by validating parsers and the entity replacement text is located by its defined SYSTEM identifier "example1.svg" (also interpreted as a relative URI). The effective content for the "img" element be the content of this second external resource. The difference with the GIF image, is that the SVG image is parsed within the SGML document, according to the declarations in the DTD, where the GIF image is just referenced as an opaque external object (which is not parsable with SGML) via its "data" attribute (whose value type is an opaque ENTITY).

Only one notation name may be specified in the value of ENTITY attributes (there is no support in SGML, XML 1.0 or XML 1.1 for multiple notation names in the same declared external ENTITY, so separate attributes are needed). However multiple external entities may be referenced (in a space-separated list of names) in attributes declared with type ENTITIES, and where each named external entity is also declared with its own notation).

Notations are also completely opaque for XML and SGML parsers, so they are not differentiated by the type of the external entity that they may reference (for these parsers they just have a unique name associated to a public identifier (an FPI) and/or a system identifier (a URI)).

Some applications (but not XML or SGML parsers themselves) also allow referencing notations indirectly by naming them in the "URN:''name''" value of a standard CDATA attribute, everywhere a URI can be specified. However this behaviour is application-specific, and requires that the application maintains a catalog of known URNs to resolve them into the notations that have been parsed in a standard SGML or XML parser. This use allows notations to be defined only in a DTD stored as an external entity and referenced only as the external subset of documents, and allows these documents to remain compatible with validating XML or SGML parsers that have no direct support for notations.

Notations are not used in HTML, or in basic profiles for XHTML and SVG, because:

Even in validating SGML or XML 1.0 or XML 1.1 parsers, the external entities referenced by an FPI and/or URI in declared notations are not retrieved automatically by the parsers themselves. Instead, these parsers just provide to the application the parsed FPI and/or URI associated to the notations found in the parsed SGML or XML document, and with a facility for a dictionary containing all notation names declared in the DTD; these validating parsers also check the uniqueness of notation name declarations, and report a validation error if some notation names are used anywhere in the DTD or in the document body but not declared:

XML DTDs and schema validation

The XML DTD syntax is one of several XML schema languages. However, many of the schema languages do not fully replace the XML DTD. Notably, the XML DTD allows defining entities and notations that have no direct equivalents in DTD-less XML (because internal entities and parsable external entities are not part of XML schema languages, and because other unparsed external entities and notations have no simple equivalent mappings in most XML schema languages).

Most XML schema languages are only replacements for element declarations and attribute list declarations, in such a way that it becomes possible to parse XML documents with non-validating XML parsers (if the only purpose of the external DTD subset was to define the schema). In addition, documents for these XML schema languages must be parsed separately, so validating the schema of XML documents in pure standalone mode is not really possible with these languages: the document type declaration remains necessary for at least identifying (with a XML Catalog) the schema used in the parsed XML document and that is validated in another language.

A common misconception holds that a non-validating XML parser does not have to read document type declarations, when in fact, the document type declarations must still be scanned for correct syntax as well as validity of declarations, and the parser must still parse all entity declarations in the internal subset, and substitute the replacement texts of internal entities occurring anywhere in the document type declaration or in the document body.

A non-validating parser may, however, elect not to read parsable external entities (including the external subset), and does not have to honor the content model restrictions defined in element declarations and in attribute list declarations.

If the XML document depends on parsable external entities (including the specified external subset, or parsable external entities declared in the internal subset), it should assert standalone="no" in its XML declaration. The validating DTD may be identified by using XML Catalogs to retrieve its specified external subset.

In the example below, the XML document is declared with standalone="no" because it has an external subset in its document type declaration:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE people_list SYSTEM "example.dtd"><people_list/>

If the XML document type declaration includes any SYSTEM identifier for the external subset, it can not be safely processed as standalone: the URI should be retrieved, otherwise there may be unknown named character entities whose definition may be needed to correctly parse the effective XML syntax in the internal subset or in the document body (the XML syntax parsing is normally performed after the substitution of all named entities, excluding the five entities that are predefined in XML and that are implicitly substituted after parsing the XML document into lexical tokens). If it just includes any PUBLIC identifier, it may be processed as standalone, if the XML processor knows this PUBLIC identifier in its local catalog from where it can retrieve an associated DTD entity.

XML DTD schema example

An example of a very simple external XML DTD to describe the schema of a list of persons might consist of:

<!ELEMENTpeople_list(person)*><!ELEMENTperson(name,birthdate?,gender?,socialsecuritynumber?)><!ELEMENTname(#PCDATA)><!ELEMENTbirthdate(#PCDATA)><!ELEMENTgender(#PCDATA)><!ELEMENTsocialsecuritynumber(#PCDATA)>

Taking this line by line:

  1. people_list is a valid element name, and an instance of such an element contains any number of person elements. The * denotes there can be 0 or more person elements within the people_list element.
  2. person is a valid element name, and an instance of such an element contains one element named name, followed by one named birthdate (optional), then gender (also optional) and socialsecuritynumber (also optional). The ? indicates that an element is optional. The reference to the name element name has no ?, so a person element must contain a name element.
  3. name is a valid element name, and an instance of such an element contains "parsed character data" (#PCDATA).
  4. birthdate is a valid element name, and an instance of such an element contains parsed character data.
  5. gender is a valid element name, and an instance of such an element contains parsed character data.
  6. socialsecuritynumber is a valid element name, and an instance of such an element contains parsed character data.

An example of an XML file that uses and conforms to this DTD follows. The DTD is referenced here as an external subset, via the SYSTEM specifier and a URI. It assumes that we can identify the DTD with the relative URI reference "example.dtd"; the "people_list" after "!DOCTYPE" tells us that the root tags, or the first element defined in the DTD, is called "people_list":

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE people_list SYSTEM "example.dtd"><people_list><person><name>FredBloggs</name><birthdate>2008-11-27</birthdate><gender>Male</gender></person></people_list>

One can render this in an XML-enabled browser (such as Internet Explorer or Mozilla Firefox) by pasting and saving the DTD component above to a text file named example.dtd and the XML file to a differently-named text file, and opening the XML file with the browser. The files should both be saved in the same directory. However, many browsers do not check that an XML document confirms to the rules in the DTD; they are only required to check that the DTD is syntactically correct. For security reasons, they may also choose not to read the external DTD.

The same DTD can also be embedded directly in the XML document itself as an internal subset, by encasing it within [square brackets] in the document type declaration, in which case the document no longer depends on external entities and can be processed in standalone mode:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE people_list [  <!ELEMENT people_list (person*)><!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)><!ELEMENT name (#PCDATA)><!ELEMENT birthdate (#PCDATA)><!ELEMENT gender (#PCDATA)><!ELEMENT socialsecuritynumber (#PCDATA)> ]> <people_list><person><name>FredBloggs</name><birthdate>2008-11-27</birthdate><gender>Male</gender></person></people_list>

Alternatives

Alternatives to DTDs (for specifying schemas) are available:

Security

An XML DTD can be used to create a denial of service (DoS) attack by defining nested entities that expand exponentially, or by sending the XML parser to an external resource that never returns. [10]

For this reason, .NET Framework provides a property that allows prohibiting or skipping DTD parsing, [10] and recent versions of Microsoft Office applications (Microsoft Office 2010 and higher) refuse to open XML files that contain DTD declarations.

See also

Related Research Articles

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.

Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content, and is one of a number of mathematical markup languages. Its aim is to natively integrate mathematical formulae into World Wide Web pages and other documents. It is part of HTML5 and standardised by ISO/IEC since 2015.

XSD, a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.

In web development, "tag soup" is a pejorative for syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated structural or syntax errors in HTML leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

In the Standard Generalized Markup Language (SGML), an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word. Entities are foundational to the organizational structure and definition of SGML documents. The SGML specification defines numerous entity types, which are distinguished by keyword qualifiers and context. An entity string value may variously consist of plain text, SGML tags, and/or references to previously defined entities. Certain entity types may also invoke external documents. Entities are called by reference.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

<span class="mw-page-title-main">XHTML Mobile Profile</span> Hypertextual computer language standard

XHTML Mobile Profile is an obsolete hypertextual computer language designed specifically for mobile phones and other resource-constrained devices.

A Formal Public Identifier (FPI) is a short piece of text with a particular structure that may be used to uniquely identify a product, specification or document. FPIs were introduced as part of Standard Generalized Markup Language (SGML), and serve particular purposes in formats historically derived from SGML. Some of their most common uses are as part of document type declarations (DOCTYPEs) and document type definitions (DTDs) in SGML, XML and historically HTML, but they are also used in the vCard and iCalendar file formats to identify the software product which generated the file.

XML documents typically refer to external entities, for example the public and/or system ID for the Document Type Definition. These external relationships are expressed using URIs, typically as URLs.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document with a document type definition (DTD). In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.

XML External Entity attack, or simply XXE attack, is a type of attack against an application that parses XML input. This attack occurs when XML input containing a reference to an external entity is processed by a weakly configured XML parser. This attack may lead to the disclosure of confidential data, DoS attacks, server-side request forgery, port scanning from the perspective of the machine where the parser is located, and other system impacts.

References

  1. "Introduction to DTD".
  2. "doctypedecl". Extensible Markup Language (XML) 1.1. W3C.
  3. Watt, Andrew H. (2002). Sams teach yourself XML in 10 minutes. Sams Publishing. ISBN   9780672324710.
  4. Attribute-list Declaration, Specifications of Extensible Markup Language (XML) 1.1, W3C.
  5. "DTD Entities". DTD Tutorial. W3Schools.
  6. Notation Declarations, Specifications of Extensible Markup Language (XML) 1.0, W3C.
  7. Notation Declarations, Specifications of Extensible Markup Language (XML) 1.1, W3C.
  8. "XML Schema Part 1: Structures (Second Edition)". W3C. 2004. Retrieved 2022-01-02.
  9. "ISO/IEC 19757-2:2008 - Information technology -- Document Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based validation -- RELAX NG". ISO. Retrieved 2011-05-17.
  10. 1 2 Bryan Sullivan (November 2009). "XML Denial of Service Attacks and Defenses". MSDN Magazine. Retrieved 2013-10-21.