Search engine scraping

Last updated

Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines. This is a specific form of screen scraping or web scraping dedicated to search engines only.

Contents

Most commonly larger search engine optimization (SEO) providers depend on regularly scraping keywords from search engines to monitor the competitive position of their customers' websites for relevant keywords or their indexing status.

The process of entering a website and extracting data in an automated fashion is also often called "crawling". Search engines get almost all their data from automated crawling bots.

Difficulties

Google is by far the largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies. [1]

Although Google does not take legal action against scraping, it uses a range of defensive methods that makes scraping their results a challenging task, even when the scraping tool is realistically spoofing a normal web browser:

Detection

When search engine defense thinks an access might be automated, the search engine can react differently.

The first layer of defense is a captcha page [4] where the user is prompted to verify they are a real person and not a bot or tool. Solving the captcha will create a cookie that permits access to the search engine again for a while. After about one day, the captcha page is displayed again.

The second layer of defense is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted, or the user changes their IP.

The third layer of defense is a long-term block of the entire network segment. Google has blocked large network blocks for months. This sort of block is likely triggered by an administrator and only happens if a scraping tool is sending a very high number of requests.

All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges).

Methods of scraping

To scrape a search engine successfully, the two major factors are time and amount.

The more keywords a user needs to scrape and the smaller the time for the job, the more difficult scraping will be and the more developed a scraping script or tool needs to be.

Scraping scripts need to overcome a few technical challenges:[ citation needed ]

Programming languages

When developing a scraper for a search engine, almost any programming language can be used. Although, depending on performance requirements, some languages will be favorable.

PHP is a commonly used language to write scraping scripts for websites or backend services, since it has powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the factor of a similar C/C++ code. Ruby on Rails as well as Python are also frequently used to automated scraping jobs.

Additionally, bash scripting can be used together with cURL as a command line tool to scrape a search engine.

When scraping websites and services the legal part is often a big concern for companies, for web scraping it greatly depends on the country a scraping user/company is from as well as which data or website is being scraped. With many different court rulings all over the world. [5] [6]

However, when it comes to scraping search engines the situation is different, search engines usually do not list intellectual property as they just repeat or summarize information they scraped from other websites.

The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service, [7] but even this incident did not result in a court case.

See also

Related Research Articles

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

<span class="mw-page-title-main">World Wide Web</span> Linked hypertext system on the Internet

The World Wide Web is an information system that enables content sharing over the Internet through user-friendly ways meant to appeal to users beyond IT specialists and hobbyists. It allows documents and other web resources to be accessed over the Internet according to specific rules of the Hypertext Transfer Protocol (HTTP).

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

<span class="mw-page-title-main">Proxy server</span> Computer server that makes and receives requests on behalf of a user

In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource. It improves privacy, security, and performance in the process.

A CAPTCHA is a type of challenge–response test used in computing to determine whether the user is human in order to deter bot attacks and spam.

Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may be used by attackers to bypass access controls such as the same-origin policy. During the second half of 2007, XSSed documented 11,253 site-specific cross-site vulnerabilities, compared to 2,134 "traditional" vulnerabilities documented by Symantec. XSS effects vary in range from petty nuisance to significant security risk, depending on the sensitivity of the data handled by the vulnerable site and the nature of any security mitigation implemented by the site's owner network.

<span class="mw-page-title-main">Googlebot</span> Web crawler used by Google

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

Cloaking is a search engine optimization (SEO) technique in which the content presented to the search engine spider is different from that presented to the user's browser. This is done by delivering content based on the IP addresses or the User-Agent HTTP header of the user requesting the page. When a user is identified as a search engine spider, a server-side script delivers a different version of the web page, one that contains content not present on the visible page, or that is present but not searchable. The purpose of cloaking is sometimes to deceive search engines so they display the page when it would not otherwise be displayed. However, it can also be a functional technique for informing search engines of content they would not otherwise be able to locate because it is embedded in non-textual containers, such as video or certain Adobe Flash components. Since 2006, better methods of accessibility, including progressive enhancement, have been available, so cloaking is no longer necessary for regular SEO.

In software testing, test automation is the use of software separate from the software being tested to control the execution of tests and the comparison of actual outcomes with predicted outcomes. Test automation can automate some repetitive but necessary tasks in a formalized testing process already in place, or perform additional testing that would be difficult to do manually. Test automation is critical for continuous delivery and continuous testing.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

A proxy auto-config (PAC) file defines how web browsers and other user agents can automatically choose the appropriate proxy server for fetching a given URL.

A search engine results page (SERP) is a webpage that is displayed by a search engine in response to a query by a user. The main component of a SERP is the listing of results that are returned by the search engine in response to a keyword query.

<span class="mw-page-title-main">HTTP referer</span> HTTP header field

In HTTP, "Referer" is an optional HTTP header field that identifies the address of the web page, from which the resource has been requested. By checking the referrer, the server providing the new web page can see where the request originated.

<span class="mw-page-title-main">Geotargeting</span> Website content based on a visitors location

In geomarketing and internet marketing, geotargeting is the method of delivering different content to visitors based on their geolocation. This includes country, region/state, city, metro code/zip code, organization, IP address, ISP, or other criteria. A common usage of geotargeting is found in online advertising, as well as internet television with sites such as iPlayer and Hulu. In these circumstances, content is often restricted to users geolocated in specific countries; this approach serves as a means of implementing digital rights management. Use of proxy servers and virtual private networks may give a false location.

Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.

Email spammers have developed a variety of ways to deliver email spam throughout the years, such as mass-creating accounts on services such as Hotmail or using another person's network to send email spam. Many techniques to block, filter, or otherwise remove email spam from inboxes have been developed by internet users, system administrators and internet service providers. Due to this, email spammers have developed their own techniques to send email spam, which are listed below.

XRumer is a piece of software made for spamming online forums and comment sections. It is marketed as a program for search engine optimization and was created by BotmasterLabs. It is able to register and post to forums with the aim of boosting search engine rankings. The program is able to bypass security techniques commonly used by many forums and blogs to deter automated spam, such as account registration, client detection, many forms of CAPTCHAs, and e-mail activation before posting. The program utilises SOCKS and HTTP proxies in an attempt to make it more difficult for administrators to block posts by source IP, and features a proxy checking tool to verify the integrity and anonymity of the proxies used.

Internet censorship circumvention, also referred to as going over the wall or scientific browsing in China, is the use of various methods and tools to bypass internet censorship.

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

<span class="mw-page-title-main">Searx</span> Metasearch engine

Searx is a free and open-source metasearch engine, available under the GNU Affero General Public License version 3, with the aim of protecting the privacy of its users. To this end, Searx does not share users' IP addresses or search history with the search engines from which it gathers results. Tracking cookies served by the search engines are blocked, preventing user-profiling-based results modification. By default, Searx queries are submitted via HTTP POST, to prevent users' query keywords from appearing in webserver logs. Searx was inspired by the Seeks project, though it does not implement Seeks' peer-to-peer user-sourced results ranking.

References

  1. "Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly". searchengineland.com. 11 February 2013.
  2. "Does Google know that I am using Tor Browser?". tor.stackexchange.com.
  3. "Google Groups". google.com.
  4. "My computer is sending automated queries – reCAPTCHA Help". support.google.com. Retrieved 2017-04-02.
  5. "Appeals court reverses hacker/troll "weev" conviction and sentence [Updated]". arstechnica.com. 11 April 2014.
  6. "Can Scraping Non-Infringing Content Become Copyright Infringement... Because Of How Scrapers Work?". www.techdirt.com. 10 June 2009.
  7. Singel, Ryan. "Google Catches Bing Copying; Microsoft Says 'So What?'". Wired.