Jsoup

Last updated
jsoup Java HTML Parser
Developer(s) Jonathan Hedley
Stable release
1.18.3 / December 2, 2024;0 days ago (2024-12-02) [1]
Repository
Written in Java
Operating system Cross-platform
Platform Java (JVM)
Type HTML parser
License MIT license
Website jsoup.org

jsoup is an open-source Java library designed to parse, extract, and manipulate data stored in HTML documents.

Contents

History

jsoup was created in 2009 by Jonathan Hedley. It is distributed it under the MIT License, a permissive free software license similar to the Creative Commons attribution license.

Hedley's avowed intention in writing jsoup was "to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup."

Projects powered by jsoup

jsoup is used in a number of current projects, [2] including Google's OpenRefine data-wrangling tool.

See also

Related Research Articles

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

The MIT License is a permissive software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts very few restrictions on reuse and therefore has high license compatibility.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

Mung or munge is computer jargon for a series of potentially destructive or irrevocable changes to a piece of data or a file. It is sometimes used for vague data transformation steps that are not yet clear to the speaker. Common munging operations include removing punctuation or HTML tags, data parsing, filtering, and transformation.

XInclude is a generic mechanism for merging XML documents, by writing inclusion tags in the "main" document to automatically include other documents or parts thereof. The resulting document becomes a single composite XML Information Set. The XInclude mechanism can be used to incorporate content from either XML files or non-XML text files.

<span class="mw-page-title-main">AWStats</span> Web analytics reporting tool

AWStats is an open source Web analytics reporting tool, suitable for analyzing data from Internet services such as web, streaming media, mail, and FTP servers. AWStats parses and analyzes server log files, producing HTML reports. Data is visually presented within reports by tables and bar graphs. Static reports can be created through a command line interface, and on-demand reporting is supported through a Web browser CGI program.

Microformats (μF) are a set of defined HTML classes created to serve as consistent and descriptive metadata about an element, designating it as representing a certain type of data. They allow software to process the information reliably by having set classes refer to a specific type of data rather than being arbitrary.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

HTML Tidy is a console application for correcting invalid HyperText Markup Language (HTML), detecting potential web accessibility errors, and for improving the layout and indent style of the resulting markup. It is also a cross-platform library for computer applications that provides HTML Tidy's features.

libxml2 is a software library for parsing XML documents. It is also the basis for the libxslt library which processes XSLT-1.0 stylesheets.

This comparison only covers software licenses which have a linked Wikipedia article for details and which are approved by at least one of the following expert groups: the Free Software Foundation, the Open Source Initiative, the Debian Project and the Fedora Project. For a list of licenses not specifically intended for software, see List of free-content licences.

Jakarta Mail is a Jakarta EE API used to send and receive email via SMTP, POP3 and IMAP. Jakarta Mail is built into the Jakarta EE platform, but also provides an optional package for use in Java SE.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

jQuery is a JavaScript library designed to simplify HTML DOM tree traversal and manipulation, as well as event handling, CSS animations, and Ajax. It is free, open-source software using the permissive MIT License. As of August 2022, jQuery is used by 77% of the 10 million most popular websites. Web analysis indicates that it is the most widely deployed JavaScript library by a large margin, having at least three to four times more usage than any other JavaScript library.

NEPOMUK is an open-source software specification that is concerned with the development of a social semantic desktop that enriches and interconnects data from different desktop applications using semantic metadata stored as RDF. Between 2006 and 2008 it was funded by a European Union research project of the same name that grouped together industrial and academic actors to develop various Semantic Desktop technologies.

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

<span class="mw-page-title-main">MoinMoin</span> Free wiki software

MoinMoin is a wiki engine implemented in Python, initially based on the PikiPiki wiki engine. Its name is a play on the North German greeting Moin, repeated as in WikiWiki. The MoinMoin code is licensed under the GNU General Public License v2, or any later version.

<span class="mw-page-title-main">Opa (programming language)</span>

Opa is a programming language for developing scalable web applications. It is free and open-source software released under a GNU Affero General Public License (AGPLv3), and an MIT License.

<span class="mw-page-title-main">OpenRefine</span> Application for data cleanup and data transformation

OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

References

  1. "jsoup 1.18.3 quick update" . Retrieved 2024-12-02.
  2. "Jsoup". MVNRepository / F. Rodriguez. 2015-03-08.