Original author(s) | Leonard Richardson |
---|---|
Initial release | 2004 |
Stable release | |
Repository | |
Written in | Python |
Platform | Python |
Type | HTML parser library, Web scraping |
License |
|
Website | www |
Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, [3] which is useful for web scraping. [2] [4]
Beautiful Soup was started in 2004 by Leonard Richardson.[ citation needed ] It takes its name from the poem Beautiful Soup from Alice's Adventures in Wonderland [5] and is a reference to the term "tag soup" meaning poorly-structured HTML code. [6] Richardson continues to contribute to the project, [7] which is additionally supported by paid open-source maintainers from the company Tidelift. [8]
Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x.
In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7. [9]
Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops. [10]
The example below uses the Python standard library's urllib [11] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.
#!/usr/bin/env python3# Anchor extraction from HTML documentfrombs4importBeautifulSoupfromurllib.requestimporturlopenwithurlopen("https://en.wikipedia.org/wiki/Main_Page")asresponse:soup=BeautifulSoup(response,"html.parser")foranchorinsoup.find_all("a"):print(anchor.get("href","/"))
Another example is using the Python requests library [12] to get divs on a URL.
importrequestsfrombs4importBeautifulSoupurl="https://wikipedia.com"response=requests.get(url)soup=BeautifulSoup(response.text,"html.parser")headings=soup.find_all("div")forheadinginheadings:print(heading.text.strip())
In computing, Common Gateway Interface (CGI) is an interface specification that enables web servers to execute an external program to process HTTP or HTTPS user requests.
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript, a programming language.
A bookmarklet is a bookmark stored in a web browser that contains JavaScript commands that add new features to the browser. They are stored as the URL of a bookmark in a web browser or as a hyperlink on a web page. Bookmarklets are usually small snippets of JavaScript executed when user clicks on them. When clicked, bookmarklets can perform a wide variety of operations, such as running a search query from selected text or extracting data from a table.
An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.
YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.
In web development, "tag soup" is a pejorative for HTML written for a web page that is syntactically or structurally incorrect. Web browsers have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.
A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.
The anchor text, link label, or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the "a element", or <a>
. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.
A dynamic web page is a web page constructed at runtime, as opposed to a static web page, delivered as it is stored.
In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
hCard is a microformat for publishing the contact details of people, companies, organizations, and places, in HTML, Atom, RSS, or arbitrary XML. The hCard microformat does this using a 1:1 representation of vCard properties and values, identified using HTML classes and rel attributes.
RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.
Geo is a microformat used for marking up geographical coordinates in HTML. Coordinates are expected in angular units of degrees and geodetic datum WGS84. Although termed a "draft" specification, the format is a de facto standard, stable and in widespread use; not least as a sub-set of the published hCalendar and hCard microformat specifications, neither of which is still a draft.
Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.
BS4 or BS-4 may refer to :
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.
Vue.js is an open-source model–view–viewmodel front end JavaScript framework for building user interfaces and single-page applications. It was created by Evan You and is maintained by him and the rest of the active core team members.
Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.
Beautiful Soup is licensed under the same terms as Python itself