CDATA

Last updated

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

Contents

CDATA sections in XML

In an XML document or external entity, a CDATA section is a piece of element content that is marked up to be interpreted literally, as textual data, not as marked up content. [1] A CDATA section is merely an alternative syntax for expressing character data; there is no semantic difference between character data in a CDATA section and character data in standard syntax where, for example, "<" and "&" are represented by "&lt;" and "&amp;", respectively.

Syntax and interpretation

A CDATA section starts with the following sequence:

<![CDATA[ 

and ends with the next occurrence of the sequence:

]]> 

All characters enclosed between these two sequences are interpreted as characters, not markup or entity references. Every character is taken literally, the only exception being the ]]> sequence of characters. In:

<sender>JohnSmith</sender>

the start and end "sender" tags are interpreted as markup. However, the code:

<![CDATA[<sender>John Smith</sender>]]>

is equivalent to:

&lt;sender&gt;JohnSmith&lt;/sender&gt;

Thus, the "tags" will have exactly the same status as the "John Smith"; they will be treated as text.

Similarly, if the numeric character reference &#240; appears in element content, it will be interpreted as the single Unicode character 00F0 (small letter eth). But if the same appears in a CDATA section, it will be parsed as six characters: ampersand, hash mark, digit 2, digit 4, digit 0, semicolon.

Uses of CDATA sections

New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. Some APIs for working with XML documents do offer options for independent access to CDATA sections, but such options exist above and beyond the normal requirements of XML processing systems, and still do not change the implicit meaning of the data. Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup. CDATA sections are useful for writing XML code as text data within an XML document. For example, if one wishes to typeset a book with XSL explaining the use of an XML application, the XML markup to appear in the book itself will be written in the source file in a CDATA section.

Nesting

A CDATA section cannot contain the string "]]>" and therefore it is not possible for a CDATA section to contain nested CDATA sections. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write:

<![CDATA[]]]]><![CDATA[>]]>

This means that to encode "]]>" in the middle of a CDATA section, replace all occurrences of "]]>" with the following:

]]]]><![CDATA[>

This effectively stops and restarts the CDATA section.

Issues with encoding

In text data, any Unicode character not available in the encoding declared in the <?xml ...?> header can be represented using a &#nnn; numerical character reference. But the text within a CDATA section is strictly limited to the characters available in the encoding.

Because of this, using a CDATA section programmatically to quote data that could potentially contain '&' or '<' characters can cause problems when the data happens to contain characters that cannot be represented in the encoding. Depending on the implementation of the encoder, these characters can get lost, can get converted to the characters of the &#nnn; character reference, or can cause the encoding to fail. But they will not be maintained.

Another issue is that an XML document can be transcoded from one encoding to another during transport. When the XML document is converted to a more limited character set, such as ASCII, characters that can no longer be represented are converted to &#nnn; character references for a lossless conversion. But within a CDATA section, these characters can not be represented at all, and have to be removed or converted to some equivalent, altering the content of the CDATA section.

Use of CDATA in program output

CDATA sections in XHTML documents are liable to be parsed differently by web browsers if they render the document as HTML, since HTML parsers do not recognise the CDATA start and end markers, nor do they recognise HTML entity references such as &lt; within <script> tags. This can cause rendering problems in web browsers and can lead to cross-site scripting vulnerabilities if used to display data from untrusted sources, since the two kinds of parser will disagree on where the CDATA section ends.

Since it is useful to be able to use less-than signs (<) and ampersands (&) in web page scripts, and to a lesser extent styles, without having to remember to escape them, it is common to use CDATA markers around the text of inline <script> and <style> elements in XHTML documents. But so that the document can also be parsed by HTML parsers, which do not recognise the CDATA markers, the CDATA markers are usually commented-out, as in this JavaScript example:

<scripttype="text/javascript">//<![CDATA[document.write("<");//]]></script>

or this CSS example:

<styletype="text/css">/*<![CDATA[*/body{background-image:url("marble.png?width=300&height=300")}/*]]>*/</style>

This technique is only necessary when using inline scripts and stylesheets, and is language-specific. CSS stylesheets, for example, only support the second style of commenting-out (/* … */), but CSS also has less need for the < and & characters than JavaScript and so less need for explicit CDATA markers.

CDATA in DTDs

CDATA-type attribute value

In Document Type Definition (DTD) files for SGML and XML, an attribute value may be designated as being of type CDATA: arbitrary character data. Within a CDATA-type attribute, character and entity reference markup is allowed and will be processed when the document is read.

For example, if an XML DTD contains

<!ATTLISTfooaCDATA#IMPLIED>

it means that elements named foo may optionally have an attribute named "a" which is of type CDATA. In an XML document that is valid according to this DTD, an element like this might appear:

<fooa="1 &amp; 2 are &lt; &#51; &#x0A;"/>

and an XML parser would interpret the "a" attribute's value as being the character data "1 & 2 are < 3".

CDATA-type entity

An SGML or XML DTD may also include entity declarations in which the token CDATA is used to indicate that entity consists of character data. The character data may appear within the declaration itself or may be available externally, referenced by a URI. In either case, character reference and parameter entity reference markup is allowed in the entity, and will be processed as such when it is read.

<DISPLAY_NAMEAttribute="Y"><![CDATA[PFTEST0__COUNTER_6__:4:199:, PFTEST0__COUNTER_7__:4:199:]]></DISPLAY_NAME><SVLOBJECT><LONGname=""val=""INTEGERname=""val=""LONGname=""val=""/></SVLOBJECT>

See also

Related Research Articles

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

A document type definition (DTD) is a set of markup declarations that define a document type for an SGML-family markup language.

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content, and is one of a number of mathematical markup languages. Its aim is to natively integrate mathematical formulae into World Wide Web pages and other documents. It is part of HTML5 and is a ISO/IEC standard ISO/IEC 40314 since 2015.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The most commonly used version is HTML 4.01, which became official standard in December 1999. An HTML document is composed of a tree of simple HTML nodes, such as text nodes, and HTML elements, which add semantics and formatting to parts of document. Each element can have HTML attributes specified. Elements can also have content, including other elements and text.

YAML is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from Standard Generalized Markup Language (SGML). It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps but forbids tab characters to use as indentation thus only some JSON files are valid YAML 1.2.

In web page design, and generally for all markup languages such as SGML, HTML, and XML, a well-formed element is one that is either a) opened and subsequently closed, or b) an empty element, which in that case must be terminated; and in either case which is properly nested so that it does not overlap with other elements.

In web development, "tag soup" is a pejorative for syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated structural or syntax errors in HTML leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

<span class="mw-page-title-main">Node (computer science)</span> Basic unit of a data structure

A node is a basic unit of a data structure, such as a linked list or tree data structure. Nodes contain data and also may link to other nodes. Links between nodes are often implemented by pointers.

In the Standard Generalized Markup Language (SGML), an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word. Entities are foundational to the organizational structure and definition of SGML documents. The SGML specification defines numerous entity types, which are distinguished by keyword qualifiers and context. An entity string value may variously consist of plain text, SGML tags, and/or references to previously defined entities. Certain entity types may also invoke external documents. Entities are called by reference.

OmniMark is a fourth-generation programming language used mostly in the publishing industry. It is currently a proprietary software product of Stilo International. As of July 2022, the most recent release of OmniMark was 11.0.

Parsed Character Data (PCDATA) is a data definition that originated in Standard Generalized Markup Language (SGML), and is used also in Extensible Markup Language (XML) Document Type Definition (DTD) to designate mixed content XML elements.

A Formal Public Identifier (FPI) is a short piece of text with a particular structure that may be used to uniquely identify a product, specification or document. FPIs were introduced as part of Standard Generalized Markup Language (SGML), and serve particular purposes in formats historically derived from SGML. Some of their most common uses are as part of document type declarations (DOCTYPEs) and document type definitions (DTDs) in SGML, XML and historically HTML, but they are also used in the vCard and iCalendar file formats to identify the software product which generated the file.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document with a document type definition (DTD). In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.

References