Beautiful Soup (HTML parser)

Last updated
Beautiful Soup
Original author(s) Leonard Richardson
Initial release2004 (2004)
Stable release
4.12.3 [1]   OOjs UI icon edit-ltr-progressive.svg / 17 January 2024;4 months ago (17 January 2024)
Repository
Written in Python
Platform Python
Type HTML parser library, Web scraping
License
Website www.crummy.com/software/BeautifulSoup/

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, [3] which is useful for web scraping. [2] [4]

Contents

History

Beautiful Soup was started in 2004 by Leonard Richardson.[ citation needed ] It takes its name from the poem Beautiful Soup from Alice's Adventures in Wonderland [5] and is a reference to the term "tag soup" meaning poorly-structured HTML code. [6] Richardson continues to contribute to the project, [7] which is additionally supported by paid open-source maintainers from the company Tidelift. [8]

Versions

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x.

In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7. [9]

Usage

Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops. [10]

Code example

The example below uses the Python standard library's urllib [11] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.

#!/usr/bin/env python3# Anchor extraction from HTML documentfrombs4importBeautifulSoupfromurllib.requestimporturlopenwithurlopen('https://en.wikipedia.org/wiki/Main_Page')asresponse:soup=BeautifulSoup(response,'html.parser')foranchorinsoup.find_all('a'):print(anchor.get('href','/'))

See also

Related Research Articles

<span class="mw-page-title-main">Document Object Model</span> Convention for representing and interacting with objects in HTML, XHTML, and XML documents

The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an HTML or XML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree; with them one can change the structure, style or content of a document. Nodes can have event handlers attached to them. Once an event is triggered, the event handlers get executed.

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

HyperText Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data are being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

Web standards are the formal, non-proprietary standards and other technical specifications that define and describe aspects of the World Wide Web. In recent years, the term has been more frequently associated with the trend of endorsing a set of standardized best practices for building web sites, and a philosophy of web design and development that includes those methods.

In web development, "tag soup" is a pejorative for HTML written for a web page that is syntactically or structurally incorrect. Web browsers have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

A query string is a part of a uniform resource locator (URL) that assigns values to specified parameters. A query string commonly includes fields added to a base URL by a Web browser or other client application, for example as part of an HTML document, choosing the appearance of a page, or jumping to positions in multimedia content.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

<span class="mw-page-title-main">PyQt</span> Python GUI library

PyQt is a Python binding of the cross-platform GUI toolkit Qt, implemented as a Python plug-in. PyQt is free software developed by the British firm Riverbank Computing. It is available under similar terms to Qt versions older than 4.5; this means a variety of licenses including GNU General Public License (GPL) and commercial license, but not the GNU Lesser General Public License (LGPL). PyQt supports Microsoft Windows as well as various kinds of UNIX, including Linux and MacOS.

The anchor text, link label or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the a element, or <a>. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.

<span class="mw-page-title-main">Dynamic web page</span> Type of web page

A dynamic web page is a web page constructed at runtime, as opposed to a static web page, delivered as it is stored. A server-side dynamic web page is a web page whose construction is controlled by an application server processing server-side scripts. In server-side scripting, parameters determine how the assembly of every new web page proceeds, and including the setting up of more client-side processing. A client-side dynamic web page processes the web page using JavaScript running in the browser as it loads. JavaScript can interact with the page via Document Object Model (DOM), to query page state and modify it. Even though a web page can be dynamic on the client-side, it can still be hosted on a static hosting service such as GitHub Pages or Amazon S3 as long as there is not any server-side code included.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

<span class="mw-page-title-main">Web template system</span> System in web publishing

A web template system in web publishing allows web designers and developers to work with web templates to automatically generate custom web pages, such as the results from a search. This reuses static web page elements while defining dynamic elements based on web request parameters. Web templates support static content, providing basic structure and appearance. Developers can implement templates from content management systems, web application frameworks, and HTML editors.

<span class="mw-page-title-main">HTML5</span> Fifth and previous version of HyperText Markup Language

HTML5 is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML Living Standard. It is maintained by the Web Hypertext Application Technology Working Group (WHATWG), a consortium of the major browser vendors.

Geo is a microformat used for marking up geographical coordinates in HTML. Coordinates are expected in angular units of degrees and geodetic datum WGS84. Although termed a "draft" specification, the format is a de facto standard, stable and in widespread use; not least as a sub-set of the published hCalendar and hCard microformat specifications, neither of which is still a draft.

Web2py is an open-source web application framework written in the Python programming language. Web2py allows web developers to program dynamic web content using Python. Web2py is designed to help reduce tedious web development tasks, such as developing web forms from scratch, although a web developer may build a form from scratch if required.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

jsoup is an open-source Java library designed to parse, extract, and manipulate data stored in HTML documents.

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.

References

  1. Error: Unable to display the reference properly. See the documentation for details.
  2. 1 2 "Beautiful Soup website" . Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself
  3. Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN   978-1-4842-3925-4
  4. Python, Real. "Beautiful Soup: Build a Web Scraper With Python – Real Python". realpython.com. Retrieved 2023-06-01.
  5. makcorps (2022-12-13). "BeautifulSoup tutorial: Let's Scrape Web Pages with Python" . Retrieved 2024-01-24.
  6. "Python Web Scraping". Udacity. 2021-02-11. Retrieved 2024-01-24.
  7. "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.
  8. Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.
  9. Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.
  10. "How To Scrape Web Pages with Beautiful Soup and Python 3 | DigitalOcean". www.digitalocean.com. Retrieved 2023-06-01.
  11. Python, Real. "Python's urllib.request for HTTP Requests – Real Python". realpython.com. Retrieved 2023-06-01.