Billion laughs attack

Last updated

In computer security, a billion laughs attack is a type of denial-of-service (DoS) attack which is aimed at parsers of XML documents. [1]

Contents

It is also referred to as an XML bomb or as an exponential entity expansion attack. [2]

Details

The example attack consists of defining 10 entities, each defined as consisting of 10 of the previous entity, with the document consisting of a single instance of the largest entity, which expands to one billion copies of the first entity.

In the most frequently cited example, the first entity is the string "lol", hence the name "billion laughs". At the time this vulnerability was first reported, the computer memory used by a billion instances of the string "lol" would likely exceed that available to the process parsing the XML.

While the original form of the attack was aimed specifically at XML parsers, the term may be applicable to similar subjects as well. [1]

The problem was first reported as early as 2002, [3] but began to be widely addressed in 2008. [4]

Defenses against this kind of attack include capping the memory allocated in an individual parser if loss of the document is acceptable, or treating entities symbolically and expanding them lazily only when (and to the extent) their content is to be used.

Code example

<?xmlversion="1.0"?><!DOCTYPElolz[<!ENTITYlol"lol"><!ELEMENTlolz(#PCDATA)><!ENTITYlol1"&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;"><!ENTITYlol2"&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;"><!ENTITYlol3"&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;"><!ENTITYlol4"&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;"><!ENTITYlol5"&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;"><!ENTITYlol6"&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;"><!ENTITYlol7"&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;"><!ENTITYlol8"&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;"><!ENTITYlol9"&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">]><lolz>&lol9;</lolz>

When an XML parser loads this document, it sees that it includes one root element, "lolz", that contains the text "&lol9;". However, "&lol9;" is a defined entity that expands to a string containing ten "&lol8;" strings. Each "&lol8;" string is a defined entity that expands to ten "&lol7;" strings, and so on. After all the entity expansions have been processed, this small (< 1 KB) block of XML will actually contain 109 = a billion "lol"s, taking up almost 3 gigabytes of memory. [5]

Variations

The billion laughs attack described above can take an exponential amount of space or time. The quadratic blowup variation causes quadratic growth in resource requirements by simply repeating a large entity over and over again, to avoid countermeasures that detect heavily nested entities. [6] (See computational complexity theory for comparisons of different growth classes.)

A "billion laughs" attack could exist for any file format that can contain macro expansions, for example this YAML bomb:

a:&a["lol","lol","lol","lol","lol","lol","lol","lol","lol"]b:&b[*a,*a,*a,*a,*a,*a,*a,*a,*a]c:&c[*b,*b,*b,*b,*b,*b,*b,*b,*b]d:&d[*c,*c,*c,*c,*c,*c,*c,*c,*c]e:&e[*d,*d,*d,*d,*d,*d,*d,*d,*d]f:&f[*e,*e,*e,*e,*e,*e,*e,*e,*e]g:&g[*f,*f,*f,*f,*f,*f,*f,*f,*f]h:&h[*g,*g,*g,*g,*g,*g,*g,*g,*g]i:&i[*h,*h,*h,*h,*h,*h,*h,*h,*h]

This crashed earlier versions of Go because the Go YAML processor (contrary to the YAML spec) expands references as if they were macros. The Go YAML processor was modified to fail parsing if the result object becomes too large.

Enterprise software like Kubernetes has been affected by this attack through its YAML parser. [7] [8] For this reason, either a parser with intentionally limited capabilities is preferred (like StrictYAML) or file formats that do not allow references are often preferred for data arriving from untrusted sources. [9] [ failed verification ]

See also

Related Research Articles

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may be used by attackers to bypass access controls such as the same-origin policy. During the second half of 2007, 11,253 site-specific cross-site vulnerabilities were documented by XSSed, compared to 2,134 "traditional" vulnerabilities documented by Symantec. XSS effects vary in range from petty nuisance to significant security risk, depending on the sensitivity of the data handled by the vulnerable site and the nature of any security mitigation implemented by the site's owner network.

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps but forbids tab characters to use as indentation thus only some JSON files are valid YAML 1.2.

<span class="mw-page-title-main">Configuration file</span> Software file used to configure the initial settings for a computer program

In computing, configuration files are files used to configure the parameters and initial settings for some computer programs. They are used for user applications, server processes and operating system settings.

XML data binding refers to a means of representing information in an XML document as a business object in computer memory. This allows applications to access the data in the XML from the object, rather than using the DOM or SAX to retrieve the data from a direct representation of the XML itself.

In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers.

In the Standard Generalized Markup Language (SGML), an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word. Entities are foundational to the organizational structure and definition of SGML documents. The SGML specification defines numerous entity types, which are distinguished by keyword qualifiers and context. An entity string value may variously consist of plain text, SGML tags, and/or references to previously defined entities. Certain entity types may also invoke external documents. Entities are called by reference.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

OmniMark is a fourth-generation programming language used mostly in the publishing industry. It is currently a proprietary software product of Stilo International. As of July 2022, the most recent release of OmniMark was 11.0.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.

JsonML, the JSON Markup Language is a lightweight markup language used to map between XML and JSON. It converts an XML document or fragment into a JSON data structure for ease of use within JavaScript environments such as a web browser, allowing manipulation of XML data without the overhead of an XML parser.

Virtual Token Descriptor for eXtensible Markup Language (VTD-XML) refers to a collection of cross-platform XML processing technologies centered on a non-extractive XML, "document-centric" parsing technique called Virtual Token Descriptor (VTD). Depending on the perspective, VTD-XML can be viewed as one of the following:

XML External Entity attack, or simply XXE attack, is a type of attack against an application that parses XML input. This attack occurs when XML input containing a reference to an external entity is processed by a weakly configured XML parser. This attack may lead to the disclosure of confidential data, DoS attacks, server-side request forgery, port scanning from the perspective of the machine where the parser is located, and other system impacts.

<span class="mw-page-title-main">Shellshock (software bug)</span> Security bug in the Unix Bash shell discovered in 2014

Shellshock, also known as Bashdoor, is a family of security bugs in the Unix Bash shell, the first of which was disclosed on 24 September 2014. Shellshock could enable an attacker to cause Bash to execute arbitrary commands and gain unauthorized access to many Internet-facing services, such as web servers, that use Bash to process requests.

XPath 3 is the latest version of the XML Path Language, a query language for selecting nodes in XML documents. It supersedes XPath 1.0 and XPath 2.0.

References

  1. 1 2 Harold, Elliotte Rusty (27 May 2005). "Tip: Configure SAX parsers for secure processing". IBM developerWorks . Archived from the original on 5 October 2010. Retrieved 4 March 2011.
  2. Sullivan, Bryan (November 2009). "XML Denial of Service Attacks and Defenses". MSDN Magazine. Microsoft Corporation . Retrieved 2011-05-31.
  3. "SecurityFocus". 2002-12-16. Archived from the original on 2021-04-16. Retrieved 2015-07-03.
  4. "CVE-2003-1564". Common Vulnerabilities and Exposures. The MITRE Corporation. 2003-02-02. Retrieved 2011-06-01.
  5. Bryan Sullivan. "XML Denial of Service Attacks and Defenses" . Retrieved 2011-12-21.
  6. "19.5. XML Processing Modules — Python 2.7.18 documentation".
  7. "CVE-2019-11253: Kubernetes API Server JSON/YAML parsing vulnerable to resource exhaustion attack · Issue #83253 · kubernetes/Kubernetes". GitHub .
  8. Wallen, Jack (9 October 2019). "Kubernetes 'Billion Laughs' Vulnerability Is No Laughing Matter". The New Stack.
  9. "XML is toast, long live JSON". 9 June 2016.