Bot prevention refers to the methods used by web services to prevent access by automated processes.
Studies suggest that over half of the traffic on the internet is bot activity, of which over half is further classified as 'bad bots'. [1]
Bots are used for various purposes online. Some bots are used passively for web scraping purposes, for example, to gather information from airlines about flight prices and destinations. Other bots, such as sneaker bots, help the bot operator acquire high-demand luxury goods; sometimes these are resold on the secondary market at higher prices, in what is commonly known as 'scalping'. [2] [3] [4]
Various fingerprinting and behavioural techniques are used to identify whether the client is a human user or a bot. In turn, bots use a range of techniques to avoid detection and appear like a human to the server. [2]
Browser fingerprinting techniques are the most common component in anti-bot protection systems. Data is usually collected through client-side JavaScript which is then transmitted to the anti-bot service for analysis. The data collected includes results from JavaScript APIs (checking if a given API is implemented and returns the results expected from a normal browser), rendering complex WebGL scenes, and using the Canvas API. [1] [5] TLS fingerprinting techniques categorise the client by analysing the supported cipher suites during the SSL handshake. [6] These fingerprints can be used to create whitelists/blacklists containing fingerprints of known browser stacks. [7] In 2017, Salesforce open sourced its TLS fingerprinting library (JA3). [8] Between August and September 2018, Akamai noticed a large increase in TLS tampering across its network to evade detection. [9] [7]
Behaviour-based techniques are also utilised, although less commonly than fingerprinting techniques, and rely on the idea that bots behave differently to human visitors. A common behavioural approach is to analyse a client's mouse movements and determine if they are typical of a human. [1] [10]
More traditional techniques such as CAPTCHAs are also often employed, however they are generally considered ineffective while simultaneously obtrusive to human visitors. [11]
The use of JavaScript can prevent some bots that rely on basic requests (such as via cURL), as these will not load the detection script and hence will fail to progress. [1] A common method to bypass many techniques is to use a headless browser to simulate a real web browser and execute the client-side JavaScript detection scripts. [2] [1] There are a variety of headless browsers that are used; some are custom (such as PhantomJS) but it is also possible to operate typical browsers such as Google Chrome in headless mode using a driver. Selenium is a common web automation framework that makes it easier to control the headless browser. [5] [1] Anti-bot detection systems attempt to identify the implementation of methods specific to these headless browsers, or the lack of proper implementation of APIs that would be implemented in regular web browsers. [1]
The source code of these JavaScript files is typically obfuscated to make it harder to reverse engineer how the detection works. [5] Common techniques include: [12]
debugger
statements, to prevent use of debuggers like DevTools Anti-bot protection services are offered by various internet companies, such as Cloudflare [13] and Akamai. [14] [15]
In the United States, the Better Online Tickets Sales Act (commonly known as the BOTS Act) was passed in 2016 to prevent some uses of bots in commerce. [16] A year later, the United Kingdom passed similar regulations in the Digital Economy Act 2017. [17] [18] The effectiveness of these measures is disputed. [19]
Hypertext Transfer Protocol Secure (HTTPS) is an extension of the Hypertext Transfer Protocol (HTTP). It uses encryption for secure communication over a computer network, and is widely used on the Internet. In HTTPS, the communication protocol is encrypted using Transport Layer Security (TLS) or, formerly, Secure Sockets Layer (SSL). The protocol is therefore also referred to as HTTP over TLS, or HTTP over SSL.
In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource. It improves privacy, security, and performance in the process.
In cryptography and computer security, a man-in-the-middle (MITM) attack is a cyberattack where the attacker secretly relays and possibly alters the communications between two parties who believe that they are directly communicating with each other, as the attacker has inserted themselves between the two parties.
Transport Layer Security (TLS) is a cryptographic protocol designed to provide communications security over a computer network. The protocol is widely used in applications such as email, instant messaging, and voice over IP, but its use in securing HTTPS remains the most publicly visible.
A public key infrastructure (PKI) is a set of roles, policies, hardware, software and procedures needed to create, manage, distribute, use, store and revoke digital certificates and manage public-key encryption. The purpose of a PKI is to facilitate the secure electronic transfer of information for a range of network activities such as e-commerce, internet banking and confidential email. It is required for activities where simple passwords are an inadequate authentication method and more rigorous proof is required to confirm the identity of the parties involved in the communication and to validate the information being transferred.
Address munging is the practice of disguising an e-mail address to prevent it from being automatically collected by unsolicited bulk e-mail providers. Address munging is intended to disguise an e-mail address in a way that prevents computer software from seeing the real address, or even any address at all, but still allows a human reader to reconstruct the original and contact the author: an email address such as, "no-one@example.com", becomes "no-one at example dot com", for instance.
A content delivery network, or content distribution network (CDN), is a geographically distributed network of proxy servers and their data centers. The goal is to provide high availability and performance by distributing the service spatially relative to end users. CDNs came into existence in the late 1990s as a means for alleviating the performance bottlenecks of the Internet as the Internet was starting to become a mission-critical medium for people and enterprises. Since then, CDNs have grown to serve a large portion of the Internet content today, including web objects, downloadable objects, applications, live streaming media, on-demand streaming media, and social media sites.
In computer security, a drive-by download is the unintended download of software, typically malicious software. The term "drive-by download" usually refers to a download which was authorized by a user without understanding what is being downloaded, such as in the case of a Trojan virus. In other cases, the term may simply refer to a download which occurs without a user's knowledge. Common types of files distributed in drive-by download attacks include computer viruses, spyware, or crimeware.
Browser sniffing is a set of techniques used in websites and web applications in order to determine the web browser a visitor is using, and to serve browser-appropriate content to the visitor. It is also used to detect mobile browsers and send them mobile-optimized websites. This practice is sometimes used to circumvent incompatibilities between browsers due to misinterpretation of HTML, Cascading Style Sheets (CSS), or the Document Object Model (DOM). While the World Wide Web Consortium maintains up-to-date central versions of some of the most important Web standards in the form of recommendations, in practice no software developer has designed a browser which adheres exactly to these standards; implementation of other standards and protocols, such as SVG and XMLHttpRequest, varies as well. As a result, different browsers display the same page differently, and so browser sniffing was developed to detect the web browser in order to help ensure consistent display of content.
In computer science, session hijacking, sometimes also known as cookie hijacking, is the exploitation of a valid computer session—sometimes also called a session key—to gain unauthorized access to information or services in a computer system. In particular, it is used to refer to the theft of a magic cookie used to authenticate a user to a remote server. It has particular relevance to web developers, as the HTTP cookies used to maintain a session on many websites can be easily stolen by an attacker using an intermediary computer or with access to the saved cookies on the victim's computer. After successfully stealing appropriate session cookies an adversary might use the Pass the Cookie technique to perform session hijacking. Cookie hijacking is commonly used against client authentication on the internet. Modern web browsers use cookie protection mechanisms to protect the web from being attacked.
HTTP cookies are small blocks of data created by a web server while a user is browsing a website and placed on the user's computer or other device by the user's web browser. Cookies are placed on the device used to access a website, and more than one cookie may be placed on a user's device during a session.
Server Name Indication (SNI) is an extension to the Transport Layer Security (TLS) computer networking protocol by which a client indicates which hostname it is attempting to connect to at the start of the handshaking process. The extension allows a server to present one of multiple possible certificates on the same IP address and TCP port number and hence allows multiple secure (HTTPS) websites to be served by the same IP address without requiring all those sites to use the same certificate. It is the conceptual equivalent to HTTP/1.1 name-based virtual hosting, but for HTTPS. This also allows a proxy to forward client traffic to the right server during TLS/SSL handshake. The desired hostname is not encrypted in the original SNI extension, so an eavesdropper can see which site is being requested. The SNI extension was specified in 2003 in RFC 3546
A device fingerprint or machine fingerprint is information collected about the software and hardware of a remote computing device for the purpose of identification. The information is usually assimilated into a brief identifier using a fingerprinting algorithm. A browser fingerprint is information collected specifically by interaction with the web browser of the device.
Mibbit is a web-based client for web browsers that supports Internet Relay Chat (IRC), Yahoo! Messenger, and Twitter. It is developed by Jimmy Moore and is designed around the Ajax model with a user interface written in JavaScript. It is the IRC application setup by default on Firefox.
Network forensics is a sub-branch of digital forensics relating to the monitoring and analysis of computer network traffic for the purposes of information gathering, legal evidence, or intrusion detection. Unlike other areas of digital forensics, network investigations deal with volatile and dynamic information. Network traffic is transmitted and then lost, so network forensics is often a pro-active investigation.
An application programming interface (API) is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build or use such a connection or interface is called an API specification. A computer system that meets this standard is said to implement or expose an API. The term API may refer either to the specification or to the implementation. Whereas a system's user interface dictates how its end-users interact with the system in question, its API dictates how to write code that takes advantage of that system's capabilities.
CRIME is a security vulnerability in HTTPS and SPDY protocols that utilize compression, which can leak the content of secret web cookies. When used to recover the content of secret authentication cookies, it allows an attacker to perform session hijacking on an authenticated web session, allowing the launching of further attacks. CRIME was assigned CVE-2012-4929.
WebAssembly defines a portable binary-code format and a corresponding text format for executable programs as well as software interfaces for facilitating interactions between such programs and their host environment.
Differential testing, also known as differential fuzzing, is a popular software testing technique that attempts to detect bugs, by providing the same input to a series of similar applications, and observing differences in their execution. Differential testing complements traditional software testing, because it is well-suited to find semantic or logic bugs that do not exhibit explicit erroneous behaviors like crashes or assertion failures. Differential testing is sometimes called back-to-back testing.