Jsoup

Last updated
jsoup Java HTML Parser
Developer(s) Jonathan Hedley
Stable release
1.16.1 / April 29, 2023;0 days ago (2023-04-29) [1]
Repository
Written in Java
Operating system Cross-platform
Platform Java (JVM)
Type HTML parser
License MIT license
Website jsoup.org

jsoup is an open-source Java library designed to parse, extract, and manipulate data stored in HTML documents.

Contents

History

jsoup was created in 2009 by Jonathan Hedley. It is distributed it under the MIT License, a permissive free software license similar to the Creative Commons attribution license.

Hedley's avowed intention in writing jsoup was "to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup."

Projects powered by jsoup

jsoup is used in a number of current projects, [2] including Google's OpenRefine data-wrangling tool.

See also

Related Research Articles

<span class="mw-page-title-main">Document Object Model</span> Convention for representing and interacting with objects in HTML, XHTML, and XML documents

The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an HTML or XML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree; with them one can change the structure, style or content of a document. Nodes can have event handlers attached to them. Once an event is triggered, the event handlers get executed.

<span class="mw-page-title-main">WebObjects</span> Java web application server and framework originally developed by NeXT Software

WebObjects was a Java web application server and a server-based web application framework originally developed by NeXT Software, Inc.

XForms is an XML format used for collecting inputs from web forms. XForms was designed to be the next generation of HTML / XHTML forms, but is generic enough that it can also be used in a standalone manner or with presentation languages other than XHTML to describe a user interface and a set of common data manipulation tasks.

In computer-based language recognition, ANTLR, or ANother Tool for Language Recognition, is a parser generator that uses a LL(*) algorithm for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is Professor Terence Parr of the University of San Francisco.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers.

<span class="mw-page-title-main">NetSurf</span> Web browser

NetSurf is an open-source web browser which uses its own layout engine. Its design goal is to be lightweight and portable. NetSurf provides features including tabbed browsing, bookmarks and page thumbnailing.

libxml2 is a software library for parsing XML documents. It is also the basis for the libxslt library which processes XSLT-1.0 stylesheets.

Google Developers is Google's site for software development tools and platforms, application programming interfaces (APIs), and technical resources. The site contains documentation on using Google developer tools and APIs—including discussion groups and blogs for developers using Google's developer products.

Jakarta Mail is a Jakarta EE API used to send and receive email via SMTP, POP3 and IMAP. Jakarta Mail is built into the Java EE platform, but also provides an optional package for use in Java SE.

jQuery is a JavaScript framework designed to simplify HTML DOM tree traversal and manipulation, as well as event handling, CSS animation, and Ajax. It is free, open-source software using the permissive MIT License. As of Aug 2022, jQuery is used by 77% of the 10 million most popular websites. Web analysis indicates that it is the most widely deployed JavaScript library by a large margin, having at least 3 to 4 times more usage than any other JavaScript library.

JSDoc is a markup language used to annotate JavaScript source code files. Using comments containing JSDoc, programmers can add documentation describing the application programming interface of the code they're creating. This is then processed, by various tools, to produce documentation in accessible formats like HTML and Rich Text Format. The JSDoc specification is released under CC BY-SA 3.0, while its companion documentation generator and parser library is free software under the Apache License 2.0.

NEPOMUK is an open-source software specification that is concerned with the development of a social semantic desktop that enriches and interconnects data from different desktop applications using semantic metadata stored as RDF. Between 2006 and 2008 it was funded by a European Union research project of the same name that grouped together industrial and academic actors to develop various Semantic Desktop technologies.

JFugue is an open source programming library that allows one to program music in the Java programming language without the complexities of MIDI. It was first released in 2002 by David Koelle. Version 2 was released under a proprietary license. Versions 3 and 4 were released under the LGPL-2.1-or-later license. The current version, JFugue 5.0, was released in March 2015, under the Apache-2.0 license. Brian Eubanks has described JFugue as "useful for applications that need a quick and easy way to play music or to generate MIDI files."

<span class="mw-page-title-main">Opa (programming language)</span>

Opa is an open-source programming language for developing scalable web applications.

<span class="mw-page-title-main">OpenRefine</span> Application for data cleanup and data transformation

OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

<span class="mw-page-title-main">Webix</span> JS/HTML5/CSS3 UI toolkit for developing complex and dynamic cross-platform web applications

Webix is a JavaScript/HTML5/CSS3 UI toolkit for developing complex and dynamic cross-platform web applications.

<span class="mw-page-title-main">React (software)</span> JavaScript library for building user interfaces

React is a free and open-source front-end JavaScript library for building user interfaces based on components. It is maintained by Meta and a community of individual developers and companies.

<span class="mw-page-title-main">Vue.js</span> Open-source JavaScript framework for building user interfaces

Vue.js is an open-source model–view–viewmodel front end JavaScript framework for building user interfaces and single-page applications. It was created by Evan You, and is maintained by him and the rest of the active core team members.

References

  1. "jsoup Java HTML Parser release 1.16.1" . Retrieved 29 Apr 2023.
  2. "Jsoup". MVNRepository / F. Rodriguez. 2015-03-08.