Data scraping

Last updated

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

Contents

Description

Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity. Very often, these transmissions are not human-readable at all.

Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.

Data scraping is most often done either to interface to a legacy system, which has no other mechanism which is compatible with current hardware, or to interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system will often see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.

Data scraping is generally considered an ad hoc , inelegant technique, often used only as a "last resort" when no other mechanism for data interchange is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of error handling logic present in the computer, this failure can result in error messages, corrupted output or even program crashes.

However, setting up a data scraping pipeline nowadays is straightforward, requiring minimal programming effort to meet practical needs (especially in biomedical data integration). [1]

Technical variants

Screen scraping

A screen fragment and a screen-scraping interface (blue box with red arrow) to customize data capture process. Screen-Scraping-OCRget.jpg
A screen fragment and a screen-scraping interface (blue box with red arrow) to customize data capture process.

Although the use of physical "dumb terminal" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire Web interfaces, some Web applications merely continue to use the technique of screen scraping to capture old screens and transfer the data to modern front-ends. [2]

Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, screen scraping referred to the practice of reading text data from a computer display terminal's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human.

As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized data processing. Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters (such systems are still in use today, for various reasons). The desire to interface such a system to more modern systems is common. A robust solution will often require things no longer available, such as source code, system documentation, APIs, or programmers with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of robotic process automation software, called RPA or RPAAI for self-guided RPA 2.0 based on artificial intelligence.

In the 1980s, financial data providers such as Reuters, Telerate, and Quotron displayed data in 24×80 format intended for a human reader. Users of this data, particularly investment banks, wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper shredder. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on VAX/VMS called the Logicizer. [3]

More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an OCR engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results. [4] This can be combined in the case of GUI applications, with querying the graphical controls by programmatically obtaining references to their underlying programming objects. A sequence of screens is automatically captured and converted into a database.

Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of images or PDF files, so there are some overlaps with generic "document scraping" and report mining techniques.

There are many tools that can be used for screen scraping. [5]

Web scraping

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API or tool to extract data from a website. [6] Companies like Amazon AWS and Google provide web scraping tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the webserver. A web scraper uses a website's URL to extract data, and stores this data for subsequent analysis. This method of web scraping enables the extraction of data in an efficient and accurate manner. [7]

Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate the human processing that occurs when viewing a webpage to automatically extract useful information. [8] [9]

Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers. [10]

Report mining

Report mining is the extraction of data from human-readable computer reports. Conventional data extraction requires a connection to a working source system, suitable connectivity standards or an API, and usually complex querying. By using the source system's standard reporting options, and directing the output to a spool file instead of to a printer, static reports can be generated suitable for offline analysis via report mining. [11] This approach can avoid intensive CPU usage during business hours, can minimise end-user licence costs for ERP customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as HTML, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system.

See also

Related Research Articles

Computer programming or coding is the composition of sequences of instructions, called programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of procedures, by writing code in one or more programming languages. Programmers typically use high-level programming languages that are more easily intelligible to humans than machine code, which is directly executed by the central processing unit. Proficient programming usually requires expertise in several different subjects, including knowledge of the application domain, details of programming languages and generic code libraries, specialized algorithms, and formal logic.

<span class="mw-page-title-main">User interface</span> Means by which a user interacts with and controls a machine

In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine from the human end, while the machine simultaneously feeds back information that aids the operators' decision-making process. Examples of this broad concept of user interfaces include the interactive aspects of computer operating systems, hand tools, heavy machinery operator controls and process controls. The design considerations applicable when creating user interfaces are related to, or involve such disciplines as, ergonomics and psychology.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reccurring schedules either as single jobs or aggregated into a batch of jobs.

<span class="mw-page-title-main">Computer terminal</span> Computer input/output device for users

A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. The teletype was an example of an early-day hard-copy terminal and predated the use of a computer screen by decades. Starting in the mid-1970s with machines such as the Sphere 1, Sol-20, and Apple I, terminal circuitry began to be integrated into personal and workstation computer systems, with the computer handling character generation and outputting to a CRT display such as a computer monitor or, sometimes, a consumer TV.

<span class="mw-page-title-main">Screen reader</span> Assistive technology that converts text or images to speech or Braille

A screen reader is a form of assistive technology (AT) that renders text and image content as speech or braille output. Screen readers are essential to people who are blind, and are useful to people who are visually impaired, illiterate, or have a learning disability. Screen readers are software applications that attempt to convey what people with normal eyesight see on a display to their users via non-visual means, like text-to-speech, sound icons, or a braille device. They do this by applying a wide variety of techniques that include, for example, interacting with dedicated accessibility APIs, using various operating system features, and employing hooking techniques.

In software engineering, the terms frontend and backend refer to the separation of concerns between the presentation layer (frontend), and the data access layer (backend) of a piece of software, or the physical infrastructure or hardware. In the client–server model, the client is usually considered the frontend and the server is usually considered the backend, even when some presentation work is actually done on the server itself.

<span class="mw-page-title-main">Text-based user interface</span> Type of interface based on outputting to or controlling a text display

In computing, text-based user interfaces (TUI), is a retronym describing a type of user interface (UI) common as an early form of human–computer interaction, before the advent of bitmapped displays and modern conventional graphical user interfaces (GUIs). Like modern GUIs, they can use the entire screen area and may accept mouse and other inputs. They may also use color and often structure the display using box-drawing characters such as ┌ and ╣. The modern context of use is usually a terminal emulator.

In software testing, test automation is the use of software separate from the software being tested to control the execution of tests and the comparison of actual outcomes with predicted outcomes. Test automation can automate some repetitive but necessary tasks in a formalized testing process already in place, or perform additional testing that would be difficult to do manually. Test automation is critical for continuous delivery and continuous testing.

Scrape, scraper or scraping may refer to:

<span class="mw-page-title-main">Shell (computing)</span> Computer program that exposes an operating systems services to a human user or other programs

In computing, a shell is a computer program that exposes an operating system's services to a human user or other programs. In general, operating system shells use either a command-line interface (CLI) or graphical user interface (GUI), depending on a computer's role and particular operation. It is named a shell because it is the outermost layer around the operating system.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Content migration is the process of moving information stored on a given computer information system (IS) to a new system. The IS may be a Web content management system (CMS), a digital asset management (DAM), or a document management system (DMS). The IS may also be based on flat HTML content, including HTML files, Active Server Pages (ASP), JavaServer Pages (JSP), PHP, or content stored in some type of HTML/JavaScript based system and can be either static or dynamic content.

<span class="mw-page-title-main">API</span> Software interface between computer programs

An application programming interface (API) is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build or use such a connection or interface is called an API specification. A computer system that meets this standard is said to implement or expose an API. The term API may refer either to the specification or to the implementation. Whereas a system's user interface dictates how its end-users interact with the system in question, its API dictates how to write code that takes advantage of that system's capabilities.

<span class="mw-page-title-main">Command-line interface</span> Computer interface that uses text

A command-line interface (CLI) is a means of interacting with a computer program by inputting lines of text called command-lines. Command-line interfaces emerged in the mid-1960s, on computer terminals, as an interactive and more user-friendly alternative to the non-interactive interface available with punched cards.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

The following outline is provided as an overview of and topical guide to natural-language processing:

OutWit Hub is a Web data extraction software application designed to automatically extract information from online or local resources. It recognizes and grabs links, images, documents, contacts, recurring vocabulary and phrases, rss feeds and converts structured and unstructured data into formatted tables which can be exported to spreadsheets or databases. The first version was released in 2010. The current version (9.0) is available for Windows 10 & Windows 11, Linux and MacOS 10.

Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.

References

  1. Glez-Peña, Daniel (April 30, 2013). "Web scraping technologies in an API world". Briefings in Bioinformatics. 15 (5): 788–797. doi:10.1093/bib/bbt026. hdl: 1822/32460 . PMID   23632294.
  2. "Back in the 1990s.. 2002 ... 2016 ... still, according to Chase Bank, a major issue. Ron Lieber (May 7, 2016). "Jamie Dimon Wants to Protect You From Innovative Start-Ups". The New York Times .
  3. Contributors Fret About Reuters' Plan To Switch From Monitor Network To IDN, FX Week, 02 Nov 1990
  4. Yeh, Tom (2009). "Sikuli: Using GUI Screenshots for Search and Automation" (PDF). UIST. Archived from the original (PDF) on 2010-02-14. Retrieved 2015-02-16.
  5. "What is Screen Scraping". June 17, 2019.
  6. Thapelo, Tsaone Swaabow; Namoshe, Molaletsa; Matsebe, Oduetse; Motshegwa, Tshiamo; Bopape, Mary-Jane Morongwa (2021-07-28). "SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data". Data Science Journal. 20: 24. doi: 10.5334/dsj-2021-024 . ISSN   1683-1470. S2CID   237719804.
  7. Singrodia, Vidhi; Mitra, Anirban; Paul, Subrata (2019-01-23). "A Review on Web Scrapping and its Applications". 2019 International Conference on Computer Communication and Informatics (ICCCI). IEEE. pp. 1–6. doi:10.1109/ICCCI.2019.8821809. ISBN   978-1-5386-8260-9.
  8. Metz, Rachel (June 1, 2012). "A Startup Hopes to Help Computers Understand Web Pages". MIT Technology Review. Retrieved 1 December 2014.
  9. VanHemert, Kyle (Mar 4, 2014). "This Simple Data-Scraping Tool Could Change How Apps Are Made". WIRED. Archived from the original on 11 May 2015. Retrieved 8 May 2015.
  10. ""Unusual traffic from your computer network"". Google Search Help. Retrieved 2017-04-04.
  11. Scott Steinacher, "Data Pump transforms host data", InfoWorld , 30 August 1999, p55

Further reading