Beautiful Soup (HTML parser)

Last updated
Beautiful Soup
Original author(s) Leonard Richardson
Initial release2004 (2004)
Stable release
4.12.3 [1]   OOjs UI icon edit-ltr-progressive.svg / 17 January 2024;10 months ago (17 January 2024)
Repository
Written in Python
Platform Python
Type HTML parser library, Web scraping
License
Website www.crummy.com/software/BeautifulSoup/

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, [3] which is useful for web scraping. [2] [4]

Contents

History

Beautiful Soup was started in 2004 by Leonard Richardson.[ citation needed ] It takes its name from the poem Beautiful Soup from Alice's Adventures in Wonderland [5] and is a reference to the term "tag soup" meaning poorly-structured HTML code. [6] Richardson continues to contribute to the project, [7] which is additionally supported by paid open-source maintainers from the company Tidelift. [8]

Versions

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x.

In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7. [9]

Usage

Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops. [10]

Code example

The example below uses the Python standard library's urllib [11] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.

#!/usr/bin/env python3# Anchor extraction from HTML documentfrombs4importBeautifulSoupfromurllib.requestimporturlopenwithurlopen('https://en.wikipedia.org/wiki/Main_Page')asresponse:soup=BeautifulSoup(response,'html.parser')foranchorinsoup.find_all('a'):print(anchor.get('href','/'))

Another example is using the Python requests library [12] to get divs on a URL.

importrequestsfrombs4importBeautifulSoupurl='https://wikipedia.com'response=requests.get(url)soup=BeautifulSoup(response.text,'html.parser')headings=soup.find_all('div')forheadinginheadings:print(heading.text.strip())

See also

Related Research Articles

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript, a programming language.

Zope is a family of free and open-source web application servers written in Python, and their associated online community. Zope stands for "Z Object Publishing Environment", and was the first system using the now common object publishing methodology for the Web. Zope has been called a Python killer app, an application that helped put Python in the spotlight.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.

YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

<span class="mw-page-title-main">Doxygen</span> Free software for generating software documentation from source code

Doxygen is a documentation generator and static analysis tool for software source trees. When used as a documentation generator, Doxygen extracts information from specially-formatted comments within the code. When used for analysis, Doxygen uses its parse tree to generate diagrams and charts of the code structure. Doxygen can cross reference documentation and code, so that the reader of a document can easily refer to the actual code.

In web development, "tag soup" is a pejorative for HTML written for a web page that is syntactically or structurally incorrect. Web browsers have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

<span class="mw-page-title-main">PyQt</span> Python GUI library

PyQt is a Python binding of the cross-platform GUI toolkit Qt, implemented as a Python plug-in. PyQt is free software developed by the British firm Riverbank Computing. It is available under similar terms to Qt versions older than 4.5; this means a variety of licenses including GNU General Public License (GPL) and commercial license, but not the GNU Lesser General Public License (LGPL). PyQt supports Microsoft Windows as well as various kinds of UNIX, including Linux and MacOS.

The anchor text, link label, or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the "a element", or <a>. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.

<span class="mw-page-title-main">Dynamic web page</span> Type of web page

A dynamic web page is a web page constructed at runtime, as opposed to a static web page, delivered as it is stored.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Geo is a microformat used for marking up geographical coordinates in HTML. Coordinates are expected in angular units of degrees and geodetic datum WGS84. Although termed a "draft" specification, the format is a de facto standard, stable and in widespread use; not least as a sub-set of the published hCalendar and hCard microformat specifications, neither of which is still a draft.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.

BS4 or BS-4 may refer to :

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

jsoup is an open-source Java library designed to parse, extract, and manipulate data stored in HTML documents.

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.

htmx is an open-source front-end JavaScript library that extends HTML with custom attributes that enable the use of AJAX directly in HTML and with a hypermedia-driven approach. These attributes allow for the dynamic definition of a web page directly in HTML and CSS, without the need for writing additional JavaScript. These attributes allows tasks that traditionally required writing JavaScript to be done completely with HTML. The library was created by Carson Gross as a new version of intercooler.js.

References

  1. "Changelog" . Retrieved 18 January 2024.
  2. 1 2 "Beautiful Soup website" . Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself
  3. Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN   978-1-4842-3925-4
  4. Python, Real. "Beautiful Soup: Build a Web Scraper With Python – Real Python". realpython.com. Retrieved 2023-06-01.
  5. makcorps (2022-12-13). "BeautifulSoup tutorial: Let's Scrape Web Pages with Python" . Retrieved 2024-01-24.
  6. "Python Web Scraping". Udacity. 2021-02-11. Retrieved 2024-01-24.
  7. "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.
  8. Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.
  9. Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.
  10. "How To Scrape Web Pages with Beautiful Soup and Python 3 | DigitalOcean". www.digitalocean.com. Retrieved 2023-06-01.
  11. Python, Real. "Python's urllib.request for HTTP Requests – Real Python". realpython.com. Retrieved 2023-06-01.
  12. Blog, SerpApi (5 March 2024). "Beautiful Soup: Web Scraping with Python". serpapi.com. Retrieved 2024-06-27.