Comparison of HTML parsers

Last updated

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

Parser License Implementation language(s)Latest date*HTML parsing [1] HTML5-compliant parsingClean HTML**Update HTML***
HTML Tidy W3C license ANSI C 2021-07-17 [2] Yes [3] YesYes [3] Yes
HtmlUnit Apache License 2.0 Java 2023-10-31 [4] Yes ?NoNo
Beautiful Soup MIT License Python 2023-04-07 [5] YesYes ?No
jsoup MIT License Java 2025-06-23 [6] YesYesYesYes
Parser License Implementation language(s)Latest date*HTML ParsingHTML5-compliant ParsingClean HTML**Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

  1. "HTML Standard". html.spec.whatwg.org. Archived from the original on January 16, 2013.
  2. "Release 5.8.0 · htacg/tidy-html5". GitHub.
  3. 1 2 "HTML Tidy". www.html-tidy.org.
  4. "Release HtmlUnit 3.7.0 · HtmlUnit/htmlunit". GitHub.
  5. "Index of /software/BeautifulSoup/bs4/download/4.12". www.crummy.com.
  6. "jsoup release 1.21.1 (2025-Jun-23)". jsoup.org.