Beautiful Soup (HTML parser)

Last updated
Beautiful Soup
Original author(s) Leonard Richardson
Initial release2004 (2004)
Stable release
4.12.3 [1]   OOjs UI icon edit-ltr-progressive.svg / 17 January 2024;12 months ago (17 January 2024)
Repository
Written in Python
Platform Python
Type HTML parser library, Web scraping
License
Website www.crummy.com/software/BeautifulSoup/

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, [3] which is useful for web scraping. [2] [4]

Contents

History

Beautiful Soup was started in 2004 by Leonard Richardson.[ citation needed ] It takes its name from the poem Beautiful Soup from Alice's Adventures in Wonderland [5] and is a reference to the term "tag soup" meaning poorly-structured HTML code. [6] Richardson continues to contribute to the project, [7] which is additionally supported by paid open-source maintainers from the company Tidelift. [8]

Versions

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x.

In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7. [9]

Usage

Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops. [10]

Code example

The example below uses the Python standard library's urllib [11] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.

#!/usr/bin/env python3# Anchor extraction from HTML documentfrombs4importBeautifulSoupfromurllib.requestimporturlopenwithurlopen("https://en.wikipedia.org/wiki/Main_Page")asresponse:soup=BeautifulSoup(response,"html.parser")foranchorinsoup.find_all("a"):print(anchor.get("href","/"))

Another example is using the Python requests library [12] to get divs on a URL.

importrequestsfrombs4importBeautifulSoupurl="https://wikipedia.com"response=requests.get(url)soup=BeautifulSoup(response.text,"html.parser")headings=soup.find_all("div")forheadinginheadings:print(heading.text.strip())

See also

Related Research Articles

<span class="mw-page-title-main">Common Gateway Interface</span> Interface between Web servers and external programs

In computing, Common Gateway Interface (CGI) is an interface specification that enables web servers to execute an external program to process HTTP or HTTPS user requests.

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript, a programming language.

<span class="mw-page-title-main">Bookmarklet</span> Web browser bookmark containing JavaScript code

A bookmarklet is a bookmark stored in a web browser that contains JavaScript commands that add new features to the browser. They are stored as the URL of a bookmark in a web browser or as a hyperlink on a web page. Bookmarklets are usually small snippets of JavaScript executed when user clicks on them. When clicked, bookmarklets can perform a wide variety of operations, such as running a search query from selected text or extracting data from a table.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.

YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

In web development, "tag soup" is a pejorative for HTML written for a web page that is syntactically or structurally incorrect. Web browsers have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

The anchor text, link label, or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the "a element", or <a>. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.

<span class="mw-page-title-main">Dynamic web page</span> Type of web page

A dynamic web page is a web page constructed at runtime, as opposed to a static web page, delivered as it is stored.

In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

hCard is a microformat for publishing the contact details of people, companies, organizations, and places, in HTML, Atom, RSS, or arbitrary XML. The hCard microformat does this using a 1:1 representation of vCard properties and values, identified using HTML classes and rel attributes.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

Geo is a microformat used for marking up geographical coordinates in HTML. Coordinates are expected in angular units of degrees and geodetic datum WGS84. Although termed a "draft" specification, the format is a de facto standard, stable and in widespread use; not least as a sub-set of the published hCalendar and hCard microformat specifications, neither of which is still a draft.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

BS4 or BS-4 may refer to :

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

<span class="mw-page-title-main">Vue.js</span> Open-source JavaScript library for building user interfaces

Vue.js is an open-source model–view–viewmodel front end JavaScript framework for building user interfaces and single-page applications. It was created by Evan You and is maintained by him and the rest of the active core team members.

Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.

References

  1. "Changelog" . Retrieved 18 January 2024.
  2. 1 2 "Beautiful Soup website" . Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself
  3. Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN   978-1-4842-3925-4
  4. Python, Real. "Beautiful Soup: Build a Web Scraper With Python – Real Python". realpython.com. Retrieved 2023-06-01.
  5. makcorps (2022-12-13). "BeautifulSoup tutorial: Let's Scrape Web Pages with Python" . Retrieved 2024-01-24.
  6. "Python Web Scraping". Udacity. 2021-02-11. Retrieved 2024-01-24.
  7. "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.
  8. Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.
  9. Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.
  10. "How To Scrape Web Pages with Beautiful Soup and Python 3 | DigitalOcean". www.digitalocean.com. Retrieved 2023-06-01.
  11. Python, Real. "Python's urllib.request for HTTP Requests – Real Python". realpython.com. Retrieved 2023-06-01.
  12. Blog, SerpApi (5 March 2024). "Beautiful Soup: Web Scraping with Python". serpapi.com. Retrieved 2024-06-27.