Polyglot markup

Last updated

In computing, a polyglot markup is a document or script written in a valid form of multiple markup languages, which performs the same output, independent of the markup's parser, layout engine, or interpreter. In general, the polyglot markup is a common subset of two or more languages, that can be used as a robust or simplified profile.

Contents

Polyglot HTML is HTML that has been written to conform to both the HTML and XHTML specifications. [1] A polyglot document can therefore be parsed as either HTML (which is SGML-compatible) or XML, and will produce the same DOM structure either way. For example, in order for an HTML5 document to meet these criteria, the two requirements are that it must have an HTML5 doctype, and be written in well-formed XHTML. [2] The same document can then be served as either HTML or XHTML, depending on browser support and MIME type.

Polyglot HTML requirements

As expressed by the html-polyglot recommendation, [1] to write a polyglot HTML5 document, the following key points should be observed:

  1. Processing instructions and the XML declaration are both forbidden in polyglot markup
  2. Specifying a document's character encoding
  3. The DOCTYPE
  4. Namespaces
  5. Element syntax (i.e. End tags are not optional. Use self-closing tags for void elements.)
  6. Element content
  7. Text (i.e. pre and textarea should not start with newline character)
  8. Attributes (i.e. Values must be quoted)
  9. Named entity references (i.e. Only amp, lt, gt, apos, quot)
  10. Comments (i.e. Use <!-- syntax -->)
  11. Scripting and styling polyglot markup

The most basic possible polyglot markup document would therefore look like this: [1]

<!DOCTYPE html><htmlxmlns="http://www.w3.org/1999/xhtml"lang=""xml:lang=""><head><title>The title element must not be empty.</title></head><body></body></html>

In a polyglot markup document non-void elements (such as script, p, div) cannot be self-closing even if they are empty, as this is not valid HTML. [3] For example, to add an empty textarea to a page, one cannot use <textarea/>, but has to use <textarea></textarea> instead.

See also

Related Research Articles

A document type definition (DTD) is a set of markup declarations that define a document type for an SGML-family markup language.

<span class="mw-page-title-main">HTML</span> Hypertext Markup Language

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

<span class="mw-page-title-main">Markup language</span> Modern system for annotating a document

Markuplanguage is a text-encoding system consisting of a set of symbols inserted in a text document to control its structure, formatting, or the relationship between its parts. Markup is often used to control the display of the document or to enrich its content to facilitating automated processing.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content, and is one of a number of mathematical markup languages. Its aim is to natively integrate mathematical formulae into World Wide Web pages and other documents. It is part of HTML5 and is a ISO/IEC standard ISO/IEC 40314 since 2015.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The most commonly used version is HTML 4.01, which became official standard in December 1999. An HTML document is composed of a tree of simple HTML nodes, such as text nodes, and HTML elements, which add semantics and formatting to parts of document. Each element can have HTML attributes specified. Elements can also have content, including other elements and text.

In computing, a polyglot is a computer program or script written in a valid form of multiple programming languages or file formats. The name was coined by analogy to multilingualism. A polyglot file is composed by combining syntax from two or more different formats. When the file formats are to be compiled or interpreted as source code, the file can be said to be a polyglot program, though file formats and source code syntax are both fundamentally streams of bytes, and exploiting this commonality is key to the development of polyglots. Polyglot files have practical applications in compatibility, but can also present a security risk when used to bypass validation or to exploit a vulnerability.

XHTML Basic is an XML-based structured markup language primarily used for simple user agents, typically mobile devices.

In web page design, and generally for all markup languages such as SGML, HTML, and XML, a well-formed element is one that is either a) opened and subsequently closed, or b) an empty element, which in that case must be terminated; and in either case which is properly nested so that it does not overlap with other elements.

In web development, "tag soup" is a pejorative for syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated structural or syntax errors in HTML leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

OmniMark is a fourth-generation programming language used mostly in the publishing industry. It is currently a proprietary software product of Stilo International. As of July 2022, the most recent release of OmniMark was 11.0.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

<span class="mw-page-title-main">HTML5</span> Fifth and current version of hypertext markup language

HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML Living Standard. It is maintained by the Web Hypertext Application Technology Working Group (WHATWG), a consortium of the major browser vendors.

SXML is an alternative syntax for writing XML data as S-expressions, to facilitate working with XML data in Lisp and Scheme. An associated suite of tools implements XPath, SAX and XSLT for SXML in Scheme and are available in the GNU Guile implementation of that language.

Haml is a templating system that is designed to avoid writing inline code in a web document and make the HTML cleaner. Haml gives you the flexibility to have some dynamic content in HTML. Similar to other template systems like eRuby, Haml also embeds some code that gets executed during runtime and generates HTML code in order to provide some dynamic content. In order to run Haml code, files need to have a .haml extension. These files are similar to .erb or .eRuby files, which also help embed Ruby code while developing a web application.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document with a document type definition (DTD). In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.

References