Tag soup

Last updated

In web development, "tag soup" is a pejorative for HTML written for a web page that is syntactically or structurally incorrect. Web browsers have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

Contents

An HTML parser (part of a web browser) that is capable of interpreting HTML-like markup even if it contains invalid syntax or structure may be called a tag soup parser. All major web browsers currently have a tag soup parser for interpreting malformed HTML, with most error-handling elements standardized.

"Tag soup" encompasses many common authoring mistakes, such as malformed HTML tags, improperly nested HTML elements, and unescaped character entities (especially ampersands (&) and less-than signs (<)).

I have used this term in my instruction for years to characterize the jumble of angle brackets acting like tags in HTML in pages that are accepted by browsers. Improper minimization, overlapping constructs ... stuff that looks like SGML markup but the creator didn't know or respect SGML rules for the HTML vocabulary. In effect a soupy collection of text and markup. [...] I've never seen the term defined anywhere.

G. Ken Holman, Re: [xml-dev] What is Tag Soup?, XML development mailing list, 11 Oct 2002.

The Markup Validation Service is a resource for web page authors to avoid creating tag soup.

Overview

"Tag soup" is a term used to denigrate various practices in web authoring. Some of these (roughly ordered from most severe to least severe) include:

  1. Malformed markup where tags are improperly nested or incorrectly closed. For example, the following:
    <p>This is a malformed fragment of <em>HTML.</p></em>
  2. Invalid structure where elements are improperly nested according to the DTD for the document. Examples of this include nesting a "ul" element directly inside another "ul" element for any of the HTML 4.01 or XHTML DTDs. Dan Connolly cites the use of title element outside the head section. [1]
  3. Use of proprietary or undefined elements and attributes instead of those defined in W3C recommendations. For example the use of the Blink element or the Marquee element which were non-standard elements originally only supported by Netscape and Internet Explorer browsers respectively.

Causes and implications

Malformed markup

Malformed markup is arguably the most severe problem in web authoring. However, thanks to better education and information and perhaps with some help from XHTML, the issue of malformed markup is becoming less common. Browsers, when faced with malformed markup, must guess the intended meaning of the author. They must infer closing tags where they expect them and then infer opening tags to match other closing-tags. The interpretation can vary markedly from one browser to the next. [2]

While many graphical web editors produce well-formed markup, an author writing code manually with a text-editor and then testing only in one browser can easily miss such errors. The presentation can therefore vary drastically from one browser to another as each tries to "correct" the authorʼs intent in different ways and then applies styling to those "corrections".

Invalid document structure

Invalid document structure here means only the use of attributes and elements where they do not belong. For example, placing a "cite" attribute on a "cite" element is invalid since the HTML and XHTML DTDs do not ascribe any meaning to that attribute on that element. Similarly, including a "p" element within the content of an "em" element is also invalid. With the move toward separating malformed markup from invalid markup, the problems with invalid markup have increasingly been seen as less severe. Some have begun to advocate looser content models that allow greater flexibility in authoring HTML documents (whether in HTML or XHTML). However, use of invalid markup can blur the author's intended meaning, though not as severely as malformed markup.

Many graphic web editors still produce invalid markup. Moreover, many professional web designers and authors pay little attention to issues of validity. It is common to see invalid markup in many of the sites throughout the World Wide Web.

Use of proprietary/discontinued elements

In the early age of the web (much of the 1990s), the design of the official HTML specification became increasingly strained, compared to the desire of designers for flexibility in creating visually vibrant designs. In response to this pressure, browser makers unilaterally added new proprietary features to HTML that fell outside the standards at the time. This meant there were proprietary elements in HTML that worked in some browsers, but not in others.

To some extent, this problem was slowed by the introduction of new standards by the W3C, such as CSS, introduced in 1998, which helped to provide greater flexibility in the presentation and layout of web pages without the need for large numbers of additional HTML elements and attributes.

Moreover, in HTML 4 and XHTML 1, many elements were either superseded by a single semantic construct (such as object elements replacing proprietary applet and embed elements) or deprecated due to being presentational (such as the "s", "strike" and "u" elements).

Nevertheless, browser developers continued to introduce new elements to HTML when they perceived a need. Some browsers included tabindex attributes on any element. Developers of Apple's WebKit introduced the canvas element, a version of which was subsequently adopted by Mozilla.

In 2004, Apple, Mozilla and Opera founded the WHATWG, with the intent of creating a new version of the HTML specification which all browser behavior would match. This included changing the specification if necessary to match an existing consensus between different browsers. [3]

The canvas [4] and embed [5] elements were subsequently standardised by the WHATWG. Certain elements (including b, i and small) which were previously considered presentational and deprecated were included, but defined in a media-independent rather than visual manner. [6]

Versions of the WHATWG specification were published by the W3C as HTML5. [3]

Evolving specifications to solve tag soup

While some of the issues of tag soup are due to shortcomings of browsers and sometimes due to a lack of information for web authors, some of the proliferation of tag soup was due to missing links in the web standards themselves. The W3C has spearheaded several efforts to address the shortcomings of web standards. As more browsers support newer revisions of standards, the pressure on web developers to use non-standard code to solve problems diminishes.

Cascading Style Sheets (CSS)

Cascading Style Sheets (CSS) provide a mechanism to specify the presentation of elements in a document without altering the markup structure of the document. Before CSS was commonplace, web developers may have resorted to some structurally invalid markup to achieve certain presentational goals – for example, including block level elements within inline elements to obtain a particular effect, or using sometimes large numbers of <font> and other display-specific HTML tags. CSS uses style rules to accomplish these tasks while leaving the markup cleaner and simpler.

XML and XHTML

XHTML is a reformulation of the HTML language based on XML. XHTML was developed to address many of the problems associated with tag soup.

XML allows parsers to separate the process of interpreting the document syntax and its structure. In HTML and SGML, a parser needed to know certain rules about elements during parsing, such as what elements could be contained within other elements and which elements implicitly close the previous element. This is because in HTML and SGML, closing tags and even opening tags were optional on some elements. By requiring all elements to have explicit opening and closing tags, XML parsers can parse the document and produce a document tree without any knowledge of the document type. This allows parsers to be universal and very light-weight, and to be separated from the process of validating or interpreting the document.

The XML specification clearly defines that a conforming user agent (such as a web browser) must not accept a document, and not continue parsing it, if any syntactical error is encountered. Thus, a browser interpreting a web page as XHTML will refuse to display the page if it encounters a formation error. This can help ensure that when authors test XHTML code against a conforming browser they will immediately be informed of malformation problems: perhaps the most severe problem facing web browsers. When code is malformed, the intent of the author is ambiguous. Without the directives of XML, HTML browsers must use complex algorithms to infer the author's intended meaning in a wide range of cases where invalid syntax is encountered.

XML and XHTML introduce the concept of namespaces. With namespaces, authors or communities of authors can define new elements and attributes with new semantics, and intermix those within their XHTML documents. Namespaces ensure that element names from the various namespaces will not be conflated. For example, a "table" element could be defined in a new namespace with new semantics different from the HTML "table" element and the browser will be able to differentiate between the two. In providing namespaces, XHTML combined with CSS allow authoring communities to easily extend the semantic vocabulary of documents. This accommodates the use of proprietary elements so long as those elements can be presented to the intended audience through complete style sheet definitions (including aural/speech and tactile styles).

XHTML documents may be served on the web using the internet media type application/xhtml+xml or text/html [7] Microsoft Internet Explorer versions before 9 do not display XHTML documents served as application/xhtml+xml. IE9 and later versions are compliant. See also the discussion of this issue in the XHTML article.

HTML5

HTML5 aims to be the most complete solution to the problem of tag soup thus far while remaining as backwards- and forwards-compatible as possible. By contrast to XHTML, which departs from backwards compatibility and takes the approach that parsers should become less tolerant of badly formed markup, HTML5 acknowledges that badly formed HTML code already exists in large quantities and will probably continue to be used, and takes the view that the specification should be expanded to ensure maximum compatibility with such code.

Thus, the HTML 5 specification has altered its definition of HTML syntax both to accommodate common syntax in use today, and to explicitly describe exactly how "badly formed code" should be treated by the parser. The handling of badly formed code now has a place in the specification itself, hopefully reducing the need for future HTML parsers to implement additional, out-of-specification measures for dealing with code that it does not recognize.

Tools

Many software tools exist which can parse and attempt to correct malformed markup, among other functions.

Valid deviations from XHTML

Unlike the strict XHTML, HTML and its predecessor SGML are designed to be written by humans, and already have a significant degree of flexibility in syntax to reduce boilerplate. These differences do not make the document invalid and are therefore not tag soup. The following apply to both HTML 4 and HTML5, [9] and examples date back to the first days of HTML. [10]

Despite their validity, these omissions still require a special parser with a knowledge of HTML (as opposed to the more rigid XML) to parse. In addition, it is common for tools to "fix" these structures too. For example, HTML Tidy allows omitting optional tags, but defaults to not doing so. [11]

See also

Notes

Related Research Articles

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

<span class="mw-page-title-main">Document Object Model</span> Convention for representing and interacting with objects in HTML, XHTML, and XML documents

The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an HTML or XML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree; with them one can change the structure, style or content of a document. Nodes can have event handlers attached to them. Once an event is triggered, the event handlers get executed.

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

<span class="mw-page-title-main">Markup language</span> Modern system for annotating a document

A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationship between its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content, and is one of a number of mathematical markup languages. Its aim is to natively integrate mathematical formulae into World Wide Web pages and other documents. It is part of HTML5 and standardised by ISO/IEC since 2015.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.

Web standards are the formal, non-proprietary standards and other technical specifications that define and describe aspects of the World Wide Web. In recent years, the term has been more frequently associated with the trend of endorsing a set of standardized best practices for building web sites, and a philosophy of web design and development that includes those methods.

XML namespaces are used for providing uniquely named elements and attributes in an XML document. They are defined in a W3C recommendation. An XML instance may contain element or attribute names from more than one XML vocabulary. If each vocabulary is given a namespace, the ambiguity between identically named elements or attributes can be resolved.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

<span class="mw-page-title-main">HTML5</span> Fifth and previous version of hypertext markup language

HTML5 is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML Living Standard. It is maintained by the Web Hypertext Application Technology Working Group (WHATWG), a consortium of the major browser vendors.

The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, leading Web browser vendors in 2004.

A Formal Public Identifier (FPI) is a short piece of text with a particular structure that may be used to uniquely identify a product, specification or document. FPIs were introduced as part of Standard Generalized Markup Language (SGML), and serve particular purposes in formats historically derived from SGML. Some of their most common uses are as part of document type declarations (DOCTYPEs) and document type definitions (DTDs) in SGML, XML and historically HTML, but they are also used in the vCard and iCalendar file formats to identify the software product which generated the file.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.

A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document with a document type definition (DTD). In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.

References

  1. Winer, Dave (12 October 2002). "What is Tag Soup?". Scripting News. Dave Winer. Archived from the original on 26 February 2004. Retrieved 23 November 2017. The example he cited is the <title> element. It really only makes sense in the <head> of a document, but apparently one or more browsers would let you set the title of a page in the body of the page! It's not like this makes the earth crumble or the sky fall, everything can proceed normally, but it's wrong to do it there and the world would be a (slightly) better place if browsers didn't allow it.
  2. Hickson, Ian (21 November 2002). "Tag Soup: How UAs handle <x><y></x></y>" . Retrieved 11 September 2020.
  3. 1 2 WHATWG. "1.6 History". HTML Standard.
  4. WHATWG. "4.12.5 The canvas element". HTML Standard.
  5. WHATWG. "4.8.6 The embed element". HTML Standard.
  6. WHATWG. "FAQ". WHATWG.org.
  7. "XHTML 1.0 The Extensible HyperText Markup Language (Second Edition) A Reformulation of HTML 4 in XML 1.0, Appendic C. HTML Compatibility Guidelines". W3C Recommendation. 1 August 2002 [26 January 2000]. Retrieved 2008-09-13. XHTML Documents which follow the guidelines set forth in Appendix C, "HTML Compatibility Guidelines" may be labeled with the Internet Media Type "text/html" [RFC2854], as they are compatible with most HTML browsers. Those documents, and any other document conforming to this specification, may also be labeled with the Internet Media Type "application/xhtml+xml" as defined in [RFC3236]. For further information on using media types with XHTML, see the informative note [XHTMLMIME].
  8. Tagliaferri, Lisa (20 July 2017). "How To Scrape Web Pages with Beautiful Soup and Python 3". Digital Ocean Tutorials. Digital Ocean. Archived from the original on 2 September 2017. Retrieved 23 November 2017. Currently available as Beautiful Soup 4 and compatible with both Python 2.7 and Python 3, Beautiful Soup creates a parse tree from parsed HTML and XML documents (including documents with non-closed tags or tag soup and other malformed markup).
  9. "§3 On SGML and HTML". HTML 4.01 Specification. W3C. 24 December 1999. §3.2.1 Elements.; HTML 5.1 2nd Edition § 8.1.2.4. Optional tags
  10. See source of HTML 2 spec § Document Structure for omission of closing li, and The original HTML Tags document for omission of closing p and head.
  11. "HTML Tidy 5.7.0 Options Quick Reference". api.html-tidy.org.