Original author(s) | Leonard Richardson |
---|---|
Initial release | 2004 |
Stable release | |
Repository | |
Written in | Python |
Platform | Python |
Type | HTML parser library, Web scraping |
License |
|
Website | www |
Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, [3] which is useful for web scraping. [2] [4]
Beautiful Soup was started in 2004 by Leonard Richardson.[ citation needed ] It takes its name from the poem Beautiful Soup from Alice's Adventures in Wonderland [5] and is a reference to the term "tag soup" meaning poorly-structured HTML code. [6] Richardson continues to contribute to the project, [7] which is additionally supported by paid open-source maintainers from the company Tidelift. [8]
Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x.
In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7. [9]
Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops. [10]
The example below uses the Python standard library's urllib [11] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.
#!/usr/bin/env python3# Anchor extraction from HTML documentfrombs4importBeautifulSoupfromurllib.requestimporturlopenwithurlopen('https://en.wikipedia.org/wiki/Main_Page')asresponse:soup=BeautifulSoup(response,'html.parser')foranchorinsoup.find_all('a'):print(anchor.get('href','/'))
Another example is using the Python requests library [12] to get divs on a URL.
importrequestsfrombs4importBeautifulSoupurl='https://wikipedia.com'response=requests.get(url)soup=BeautifulSoup(response.text,'html.parser')headings=soup.find_all('div')forheadinginheadings:print(heading.text.strip())
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript, a programming language.
Zope is a family of free and open-source web application servers written in Python, and their associated online community. Zope stands for "Z Object Publishing Environment", and was the first system using the now common object publishing methodology for the Web. Zope has been called a Python killer app, an application that helped put Python in the spotlight.
An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.
YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.
Doxygen is a documentation generator and static analysis tool for software source trees. When used as a documentation generator, Doxygen extracts information from specially-formatted comments within the code. When used for analysis, Doxygen uses its parse tree to generate diagrams and charts of the code structure. Doxygen can cross reference documentation and code, so that the reader of a document can easily refer to the actual code.
In web development, "tag soup" is a pejorative for HTML written for a web page that is syntactically or structurally incorrect. Web browsers have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.
A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.
PyQt is a Python binding of the cross-platform GUI toolkit Qt, implemented as a Python plug-in. PyQt is free software developed by the British firm Riverbank Computing. It is available under similar terms to Qt versions older than 4.5; this means a variety of licenses including GNU General Public License (GPL) and commercial license, but not the GNU Lesser General Public License (LGPL). PyQt supports Microsoft Windows as well as various kinds of UNIX, including Linux and MacOS.
The anchor text, link label, or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the "a element", or <a>
. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.
A dynamic web page is a web page constructed at runtime, as opposed to a static web page, delivered as it is stored.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Geo is a microformat used for marking up geographical coordinates in HTML. Coordinates are expected in angular units of degrees and geodetic datum WGS84. Although termed a "draft" specification, the format is a de facto standard, stable and in widespread use; not least as a sub-set of the published hCalendar and hCard microformat specifications, neither of which is still a draft.
Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.
XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.
BS4 or BS-4 may refer to :
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
jsoup is an open-source Java library designed to parse, extract, and manipulate data stored in HTML documents.
Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.
Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.
htmx is an open-source front-end JavaScript library that extends HTML with custom attributes that enable the use of AJAX directly in HTML and with a hypermedia-driven approach. These attributes allow for the dynamic definition of a web page directly in HTML and CSS, without the need for writing additional JavaScript. These attributes allows tasks that traditionally required writing JavaScript to be done completely with HTML. The library was created by Carson Gross as a new version of intercooler.js.
Beautiful Soup is licensed under the same terms as Python itself