Comparison of HTML parsers

Last updated November 02, 2023

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.

Parser	License	Implementation language(s)	Latest date*	HTML parsing^[1]	HTML5-compliant parsing	Clean HTML**	Update HTML***
HTML Tidy	W3C license	ANSI C	2021-07-17^[2]	Yes^[3]	Yes	Yes^[3]	Yes
HtmlUnit	Apache License 2.0	Java	2023-10-31^[4]	Yes	?	No	No
Beautiful Soup	MIT License	Python	2023-04-07^[5]	Yes	Yes	?	No
jsoup	MIT License	Java	2023-10-20^[6]	Yes	Yes	Yes	Yes
Parser	License	Implementation language(s)	Latest date*	HTML Parsing	HTML5-compliant Parsing	Clean HTML**	Update HTML***

* Latest release (of significant changes) date.

** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.

*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

Related Research Articles

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the meaning and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

A markuplanguage is a text-encoding system consisting of a set of symbols inserted in a text document to control its structure, formatting, or the relationship between its parts. Markup is often used to control the display of the document or to enrich its content to facilitate automated processing.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Syntax highlighting is a feature of text editors that is used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colours and fonts according to the category of terms. This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. This feature is also employed in many programming related contexts, either in the form of colorful books or online websites to make understanding code snippets easier for readers. Highlighting does not affect the meaning of the text itself; it is intended only for human readers.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The most commonly used version is HTML 4.01, which became official standard in December 1999. An HTML document is composed of a tree of simple HTML nodes, such as text nodes, and HTML elements, which add semantics and formatting to parts of document. Each element can have HTML attributes specified. Elements can also have content, including other elements and text.

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from Standard Generalized Markup Language (SGML). It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps but forbids tab characters to use as indentation thus only some JSON files are valid YAML 1.2.

Pretty-printing is the application of any of various stylistic formatting conventions to text files, such as source code, markup, and similar kinds of content. These formatting conventions may entail adhering to an indentation style, using different color and typeface to highlight syntactic elements of source code, or adjusting size, to make the content easier for people to read, and understand. Pretty-printers for source code are sometimes called code formatters or beautifiers.

In web development, "tag soup" is a pejorative for syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated structural or syntax errors in HTML leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

The Markup Validation Service is a validator by the World Wide Web Consortium (W3C) that allows Internet users to check pre-HTML5 HTML and XHTML documents for well-formed markup against a document type definition. Markup validation is an important step towards ensuring the technical quality of web pages. However, it is not a complete measure of web standards conformance. Though W3C validation is important for browser compatibility and site usability, it has not been confirmed what effect it has on search engine optimization.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE represents a blank space punctuation character in text, used as a word divider in Western scripts.

HTML Tidy is a console application for correcting invalid HyperText Markup Language (HTML), detecting potential web accessibility errors, and for improving the layout and indent style of the resulting markup. It is also a cross-platform library for computer applications that provides HTML Tidy's features.

SCXML stands for State Chart XML: State Machine Notation for Control Abstraction. It is an XML-based markup language that provides a generic state-machine-based execution environment based on Harel statecharts.

Strikethrough is a typographical presentation of words with a horizontal line through their center, resulting in text like this. Contrary to censored or sanitized (redacted) texts, the words remain readable. This presentation signifies one of two meanings. In ink-written, typewritten, or other non-erasable text, the words are a mistake and not meant for inclusion. When used on a computer screen, however, it indicates deleted information, as popularized by Microsoft Word's revision and track changes features. It can also be used deliberately to imply a change of thought.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

HTML attributes are special words used inside the opening tag to control the element's behaviour. HTML attributes are a modifier of a HTML element type. An attribute either modifies the default functionality of an element type or provides functionality to certain element types unable to function correctly without them. In HTML syntax, an attribute is added to a HTML start tag.

The Sweble Wikitext parser is an open-source tool to parse the Wikitext markup language used by MediaWiki, the software behind Wikipedia. The initial development was done by Hannes Dohrn as a Ph.D. thesis project at the Open Source Research Group of professor Dirk Riehle at the University of Erlangen-Nuremberg from 2009 until 2011. The results were presented to the public for the first time at the WikiSym conference in 2011. Before that, the dissertation was inspected and approved by an independent scientific peer-review and was published at ACM Press.

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

References

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine

[2] HTML Tidy release 5.8.0

[what_is_tidy-3] 1 2 What is Tidy?

[HtmlUnit_Release_2.50.0-4] HtmlUnit 3.7.0

[5] Beautiful Soup release 4.10

[6] soup Java HTML Parser release 1.16.2

[1]

[2]

[3]

[4]

[5]

[6]