Original author(s) |
|
---|---|
Developer(s) | |
Initial release | May 27, 2007 |
Type | Classic version: CAPTCHA New version: Behavioral analysis |
Website | google |
reCAPTCHA Inc. [1] is a CAPTCHA system owned by Google. It enables web hosts to distinguish between human and automated access to websites. The original version asked users to decipher hard-to-read text or match images. Version 2 also asked users to decipher text or match images if the analysis of cookies and canvas rendering suggested the page was being downloaded automatically. [2] Since version 3, reCAPTCHA will never interrupt users and is intended to run automatically when users load pages or click buttons. [3]
The original iteration of the service was a mass collaboration platform designed for the digitization of books, particularly those that were too illegible to be scanned by computers. The verification prompts utilized pairs of words from scanned pages, with one known word used as a control for verification, and the second used to crowdsource the reading of an uncertain word. [4] reCAPTCHA was originally developed by Luis von Ahn, David Abraham, Manuel Blum, Michael Crawford, Ben Maurer, Colin McMillen, and Edison Tan at Carnegie Mellon University's main Pittsburgh campus. [5] It was acquired by Google in September 2009. [6] The system helped to digitize the archives of The New York Times , and was subsequently used by Google Books for similar purposes. [7]
The system was reported as displaying over 100 million CAPTCHAs every day, [8] on sites such as Facebook, TicketMaster, Twitter, 4chan, CNN.com, StumbleUpon, [9] Craigslist (since June 2008), [10] and the U.S. National Telecommunications and Information Administration's digital TV converter box coupon program website (as part of the US DTV transition). [11]
In 2014, Google pivoted the service away from its original concept, with a focus on reducing the amount of user interaction needed to verify a user, and only presenting human recognition challenges (such as identifying images in a set that satisfy a specific prompt) if behavioral analysis suspects that the user may be a bot.
In October 2023, it was found that OpenAI's GPT-4 chatbot could solve CAPTCHAs. [12]
Distributed Proofreaders was the first project to volunteer its time to decipher scanned text that could not be read by optical character recognition (OCR) programs. It works with Project Gutenberg to digitize public domain material and uses methods quite different from reCAPTCHA.
The reCAPTCHA program originated with Guatemalan computer scientist Luis von Ahn, [13] and was aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles". [14]
Scanned text is subjected to analysis by two different OCRs. Any word that is deciphered differently by the two OCR programs or that is not in an English dictionary is marked as "suspicious" and converted into a CAPTCHA. The suspicious word is displayed, out of context, sometimes along with a control word already known. If the human types the control word correctly, then the response to the questionable word is accepted as probably valid. If enough users were to correctly type the control word, but incorrectly type the second word which OCR had failed to recognize, then the digital version of documents could end up containing the incorrect word. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 points, the word is considered valid. Those words that are consistently given a single identity by human judges are later recycled as control words. [15] If the first three guesses match each other but do not match either of the OCRs, they are considered a correct answer, and the word becomes a control word. [16] When six users reject a word before any correct spelling is chosen, the word is discarded as unreadable. [16]
The original reCAPTCHA method was designed to show the questionable words separately, as out-of-context correction, rather than in use, such as within a phrase of five words from the original document. [17] Also, the control word might mislead the context for the second word, such as a request of "/metal/ /fife/" being entered as "metal file" due to the logical connection of filing with a metal tool being considered more common than the musical instrument "fife".[ citation needed ]
In 2012, reCAPTCHA began using photographs taken from Google Street View project, in addition to scanned words. [18] It will ask the user to identify images of crosswalks, street lights, and other objects. It has been hypothesized that the data is used by Waymo (a Google subsidiary) to train autonomous vehicles, though an unnamed representative has denied this, claiming the data was only being used to improve Google Maps as of mid-2021. [19]
Google charges for the use of reCAPTCHA on websites that make over a million reCAPTCHA queries a month. [20]
reCAPTCHA v1 was declared end-of-life and shut down on March 31, 2018. [21]
In 2013, reCAPTCHA began implementing behavioral analysis of the browser's interactions to predict whether the user was a human or a bot. The following year, Google began to deploy a new reCAPTCHA API, featuring the "no CAPTCHA reCAPTCHA"—where users deemed to be of low risk only need to click a single checkbox to verify their identity. A CAPTCHA may still be presented if the system is uncertain of the user's risk; Google also introduced a new type of CAPTCHA challenge designed to be more accessible to mobile users, where the user must select images matching a specific prompt from a grid. [2] [22]
In 2017, Google introduced a new "invisible" reCAPTCHA, where verification occurs in the background, and no challenges are displayed at all if the user is deemed to be of low risk. [23] [24] [25] According to former Google "click fraud czar" Shuman Ghosemajumder, this capability "creates a new sort of challenge that very advanced bots can still get around, but introduces a lot less friction to the legitimate human." [25]
The reCAPTCHA tests are displayed from the central site of the reCAPTCHA project, which supplies the words to be deciphered. This is done through a JavaScript API with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free-of-charge service provided to websites for assistance with the decipherment, [26] but the reCAPTCHA software is not open-source. [27]
Also, reCAPTCHA offers plugins for several web-application platforms including ASP.NET, Ruby, and PHP, to ease the implementation of the service. [28]
The main purpose of a CAPTCHA system is to block spambots while allowing human users. On December 14, 2009, Jonathan Wilkins released a paper describing weaknesses in reCAPTCHA that allowed bots to achieve a solve rate of 18%. [30] [31] [32]
On August 1, 2010, Chad Houck gave a presentation to the DEF CON 18 Hacking Conference detailing a method to reverse the distortion added to images which allowed a computer program to determine a valid response 10% of the time. [33] [34] The reCAPTCHA system was modified on July 21, 2010, before Houck was to speak on his method. Houck modified his method to what he described as an "easier" CAPTCHA to determine a valid response 31.8% of the time. Houck also mentioned security defenses in the system, including a high-security lockout if an invalid response is given 32 times in a row. [35]
On May 26, 2012, Adam, C-P, and Jeffball of DC949 gave a presentation at the LayerOne hacker conference detailing how they were able to achieve an automated solution with an accuracy rate of 99.1%. [36] Their tactic was to use techniques from machine learning, a subfield of artificial intelligence, to analyze the audio version of reCAPTCHA which is available for the visually impaired. Google released a new version of reCAPTCHA just hours before their talk, making major changes to both the audio and visual versions of their service. In this release, the audio version was increased in length from 8 seconds to 30 seconds and is much more difficult to understand, both for humans as well as bots. In response to this update and the following one, the members of DC949 released two more versions of Stiltwalker which beat reCAPTCHA with an accuracy of 60.95% and 59.4% respectively. After each successive break, Google updated reCAPTCHA within a few days. According to DC949, they often reverted to features that had been previously hacked.
On June 27, 2012, Claudia Cruz, Fernando Uceda, and Leobardo Reyes published a paper showing a system running on reCAPTCHA images with an accuracy of 82%. [37] The authors have not said if their system can solve recent reCAPTCHA images, although they claim their work to be intelligent OCR and robust to some, if not all changes in the image database.
In an August 2012 presentation given at BsidesLV 2012, DC949 called the latest version "unfathomably impossible for humans"—they were not able to solve them manually either. [36] The web accessibility organization WebAIM reported in May 2012, "Over 90% of respondents [screen reader users] find CAPTCHA to be very or somewhat difficult". [38]
The original iteration of reCAPTCHA was criticized as being a source of unpaid work to assist in transcribing efforts. [39]
Google profits from reCAPTCHA users as free workers to improve its AI research. [40]
The current iteration of the system has been criticized for its reliance on tracking cookies and promotion of vendor lock-in with Google services; administrators are encouraged to include reCAPTCHA tracking code on all pages of their website to analyze the behavior and "risk" of users, which determines the level of friction presented when a reCAPTCHA prompt is used. [41] Google stated in its privacy policy that user data collected in this manner is not used for personalized advertising. It was also discovered that the system favors those who have an active Google account login, and displays a higher risk towards those using anonymizing proxies and VPN services. [23]
Concerns were raised regarding privacy when Google announced reCAPTCHA v3.0, as it allows Google to track users on non-Google websites. [23]
In April 2020, Cloudflare switched from reCAPTCHA to hCaptcha, citing privacy concerns over Google's potential use of the data they recollect through reCAPTCHA for targeted advertising [42] and to cut down on operating costs since a considerable portion of Cloudflare's customers are non-paying customers. In response, Google told PC Magazine that the data from reCAPTCHA is never used for personalized advertising purposes. [20]
Google's help center states that reCAPTCHA is not supported for the deafblind community, [43] effectively locking such users out of all pages that use the service. However, reCAPTCHA does currently have the longest list of accessibility considerations of any CAPTCHA service. [44]
In one of the variants of CAPTCHA challenges, images are not incrementally highlighted, but fade out when clicked, and replaced with a new image fading in, resembling whack-a-mole.
Criticism has been aimed at the long duration taken for the images to fade out and in. [45]
reCAPTCHA also created the Mailhide project, which protects email addresses on web pages from being harvested by spammers. [46] By default, the email address was converted into a format that did not allow a crawler to see the full email address; for example, "mailme@example.com" would have been converted to "mai...@example.com". The visitor would then click on the "..." and solve the CAPTCHA to obtain the full email address. One could also edit the pop-up code so that none of the addresses were visible. Mailhide was discontinued in 2018 because it relied on reCAPTCHA v1. [47]
Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating related and/or unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
A CAPTCHA is a type of challenge–response test used in computing to determine whether the user is human in order to deter bot attacks and spam.
Address munging is the practice of disguising an e-mail address to prevent it from being automatically collected by unsolicited bulk e-mail providers. Address munging is intended to disguise an e-mail address in a way that prevents computer software from seeing the real address, or even any address at all, but still allows a human reader to reconstruct the original and contact the author: an email address such as, "no-one@example.com", becomes "no-one at example dot com", for instance.
A spambot is a computer program designed to assist in the sending of spam. Spambots usually create accounts and send spam messages with them. Web hosts and website operators have responded by banning spammers, leading to an ongoing struggle between them and spammers in which spammers find new ways to evade the bans and anti-spam programs, and hosts counteract these methods.
An Internet bot, web robot, robot or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. An Internet bot plays the client role in a client–server model whereas the server role is usually played by web servers. Internet bots are able to perform simple and repetitive tasks much faster than a person could ever do. The most extensive use of bots is for web crawling, in which an automated script fetches, analyzes and files information from web servers. More than half of all web traffic is generated by bots.
Email harvesting or scraping is the process of obtaining lists of email addresses using various methods. Typically these are then used for bulk email or spam.
Google Books is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical character recognition (OCR), and stored in its digital database. Books are provided either by publishers and authors through the Google Books Partner Program, or by Google's library partners through the Library Project. Additionally, Google has partnered with a number of magazine publishers to digitize their archives.
The Scunthorpe problem is the unintentional blocking of online content by a spam filter or search engine because their text contains a string of letters that appear to have an obscene or otherwise unacceptable meaning. Names, abbreviations, and technical terms are most often cited as being affected by the issue.
Luis von Ahn is a Guatemalan-American entrepreneur, software developer, and consulting professor in the Computer Science Department at Carnegie Mellon University in Pittsburgh, Pennsylvania. He is known as one of the pioneers of crowdsourcing. He is the founder of the company reCAPTCHA, which was sold to Google in 2009, and the co-founder and CEO of Duolingo.
Google Image Labeler is a feature, in the form of a game, of Google Images that allows the user to label random images to help improve the quality of Google's image search results. It was online from 2006 to 2011 at http://images.google.com/imagelabeler/ and relaunched in 2016 at https://get.google.com/crowdsource/.
Image-based spam, or image spam, is a kind of email spam where the textual spam message is embedded into images, that are then attached to spam emails. Since most of the email clients will display the image file directly to the user, the spam message is conveyed as soon as the email is opened.
Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.
The Turing test, originally called the imitation game by Alan Turing in 1949, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. Since the Turing test is a test of indistinguishability in performance capacity, the verbal version generalizes naturally to all of human performance capacity, verbal as well as nonverbal (robotic).
Email spammers have developed a variety of ways to deliver email spam throughout the years, such as mass-creating accounts on services such as Hotmail or using another person's network to send email spam. Many techniques to block, filter, or otherwise remove email spam from inboxes have been developed by internet users, system administrators and internet service providers. Due to this, email spammers have developed their own techniques to send email spam, which are listed below.
XRumer is a piece of software made for spamming online forums and comment sections. It is marketed as a program for search engine optimization and was created by BotmasterLabs. It is able to register and post to forums with the aim of boosting search engine rankings. The program is able to bypass security techniques commonly used by many forums and blogs to deter automated spam, such as account registration, client detection, many forms of CAPTCHAs, and e-mail activation before posting. The program utilises SOCKS and HTTP proxies in an attempt to make it more difficult for administrators to block posts by source IP, and features a proxy checking tool to verify the integrity and anonymity of the proxies used.
Cloudflare, Inc. is an American company that provides content delivery network services, cloud cybersecurity, DDoS mitigation, wide area network services, reverse proxies, Domain Name Service, and ICANN-accredited domain registration services. Cloudflare's headquarters are in San Francisco, California. According to W3Techs, Cloudflare is used by more than 19% of the Internet for its web security services, as of 2024.
Social spam is unwanted spam content appearing on social networking services, social bookmarking sites, and any website with user-generated content. It can be manifested in many ways, including bulk messages, profanity, insults, hate speech, malicious links, fraudulent reviews, fake friends, and personally identifiable information.
Asprise OCR is a commercial optical character recognition and barcode recognition SDK library that provides an API to recognize text as well as barcodes from images and output in formats like plain text, XML and searchable PDF.
Human presence detection is a range of technologies and methods for detecting the presence of a human body in an area of interest (AOI), or verification that computer, smartphone is operated by human. Software and hardware technologies are used for human presence detection. Unlike human sensing, that is dealing with human body only, human presence detection technologies are used to verify for safety, security or other reasons that human person, but not any other object is identified. Methods can be used for internet security authentication. These include software technologies such CAPTCHA and reCAPTCHA, as well as hardware technologies such as: