ReDoS

Last updated

A regular expression denial of service (ReDoS) [1] is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many [2] regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive. [3] [4]

Contents

Description

Regular expression ("regex") matching can be done by building a finite-state automaton. Regex can be easily converted to nondeterministic automata (NFAs), in which for each state and input symbol, there may be several possible next states. After building the automaton, several possibilities exist:

Of the above algorithms, the first two are problematic. The first is problematic because a deterministic automaton could have up to states where is the number of states in the nondeterministic automaton; thus, the conversion from NFA to DFA may take exponential time. The second is problematic because a nondeterministic automaton could have an exponential number of paths of length , so that walking through an input of length will also take exponential time. [7] The last two algorithms, however, do not exhibit pathological behavior.

Note that for non-pathological regular expressions, the problematic algorithms are usually fast, and in practice, one can expect them to "compile" a regex in O(m) time and match it in O(n) time; instead, simulation of an NFA and lazy computation of the DFA have O(m2n) worst-case complexity. [8] Regex denial of service occurs when these expectations are applied to a regex provided by the user, and malicious regular expressions provided by the user trigger the worst-case complexity of the regex matcher.

While regex algorithms can be written in an efficient way, most regex engines in existence extend the regex languages with additional constructs that cannot always be solved efficiently. Such extended patterns essentially force the implementation of regex in most programming languages to use backtracking.

Examples

Exponential backtracking

The most severe type of problem happens with backtracking regular expression matches, where some patterns have a runtime that is exponential in the length of the input string. [9] For strings of characters, the runtime is . This happens when a regular expression has three properties:

The second condition is best explained with two examples:

In both of these examples we used $ to match the end of the string, satisfying the third condition, but it is also possible to use another character for this. For example (a|aa)*c has the same problematic structure.

All three of the above regular expressions will exhibit exponential runtime when applied to strings of the form . For example, if you try to match them against aaaaaaaaaaaaaaaaaaaaaaaa! on a backtracking expression engine, it will take a significantly long time to complete, and the runtime will approximately double for each extra a before the !.

It is also possible to have backtracking which is polynomial time , instead of exponential. This can also cause problems for long enough inputs, though less attention has been paid to this problem as malicious input must be much longer to have a significant effect. An example of such a pattern is "a*b?a*x", when the input is an arbitrarily long sequence of "a"s.

Vulnerable regexes in online repositories

So-called "evil" or vulnerable regexes have been found in online regular expression repositories. Note that it is enough to find a vulnerable subexpression in order to attack the full regex:

  1. RegExLib, id=1757 (email validation) - see red part
    ^([a-zA-Z0-9])(([\-.]|[_]+)?([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$
  2. OWASP Validation Regex Repository, Java Classname - see red part
    ^(([a-z])+.)+[A-Z]([a-z])+$

These two examples are also susceptible to the input aaaaaaaaaaaaaaaaaaaaaaaa!.

Attacks

If the regex itself is affected by user input, such as a web service permitting clients to provide a search pattern, then an attacker can inject a malicious regex to consume the server's resources. Therefore, in most cases, regular expression denial of service can be avoided by removing the possibility for the user to execute arbitrary patterns on the server. In this case, web applications and databases are the main vulnerable applications. Alternatively, a malicious page could hang the user's web browser or cause it to use arbitrary amounts of memory.

However, if a vulnerable regex exists on the server-side already, then an attacker may instead be able to provide an input that triggers its worst-case behavior. In this case, e-mail scanners and intrusion detection systems could also be vulnerable. Fortunately, in most cases, the problematic regular expressions can be rewritten as "non-evil" patterns. For example, (.*a)+ can be rewritten to ([^a]*a)+.

In the case of a web application, the programmer may use the same regular expression to validate input on both the client and the server side of the system. An attacker could inspect the client code, looking for evil regular expressions, and send crafted input directly to the web server in order to hang it. [10]

See also

Related Research Articles

<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.

<span class="mw-page-title-main">Deterministic finite automaton</span> Finite-state machine

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

In the theory of computation, a generalized nondeterministic finite automaton (GNFA), also known as an expression automaton or a generalized nondeterministic finite state machine, is a variation of a nondeterministic finite automaton (NFA) where each transition is labeled with any regular expression. The GNFA reads blocks of symbols from the input which constitute a string as defined by the regular expression on the transition. There are several differences between a standard finite state machine and a generalized nondeterministic finite state machine. A GNFA must have only one start state and one accept state, and these cannot be the same state, whereas an NFA or DFA both may have several accept states, and the start state can be an accept state. A GNFA must have only one transition between any two states, whereas a NFA or DFA both allow for numerous transitions between states. In a GNFA, a state has a single transition to every state in the machine, although often it is a convention to ignore the transitions that are labelled with the empty set when drawing generalized nondeterministic finite state machines.

In automata theory, an alternating finite automaton (AFA) is a nondeterministic finite automaton whose transitions are divided into existential and universal transitions. For example, let A be an alternating automaton.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the same formal language. It is important in theory because it establishes that NFAs, despite their additional flexibility, are unable to recognize any language that cannot be recognized by some DFA. It is also important in practice for converting easier-to-construct NFAs into more efficiently executable DFAs. However, if the NFA has n states, the resulting DFA may have up to 2n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs.

In computer science, in particular in automata theory, a two-way finite automaton is a finite automaton that is allowed to re-read its input.

In mathematics and computer science, the probabilistic automaton (PA) is a generalization of the nondeterministic finite automaton; it includes the probability of a given transition into the transition function, turning it into a transition matrix. Thus, the probabilistic automaton also generalizes the concepts of a Markov chain and of a subshift of finite type. The languages recognized by probabilistic automata are called stochastic languages; these include the regular languages as a subset. The number of stochastic languages is uncountable.

<span class="mw-page-title-main">DFA minimization</span>

In automata theory, DFA minimization is the task of transforming a given deterministic finite automaton (DFA) into an equivalent DFA that has a minimum number of states. Here, two DFAs are called equivalent if they recognize the same regular language. Several different algorithms accomplishing this task are known and described in standard textbooks on automata theory.

In computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used to match strings against the regular expression. This algorithm is credited to Ken Thompson.

In computational learning theory, induction of regular languages refers to the task of learning a formal description of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way, approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction.

In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equivalence of several description formats for regular languages. Alternative presentations of the same method include the "elimination method" attributed to Brzozowski and McCluskey, the algorithm of McNaughton and Yamada, and the use of Arden's lemma.

In automata theory, an unambiguous finite automaton (UFA) is a nondeterministic finite automaton (NFA) such that each word has at most one accepting path. Each deterministic finite automaton (DFA) is an UFA, but not vice versa. DFA, UFA, and NFA recognize exactly the same class of formal languages. On the one hand, an NFA can be exponentially smaller than an equivalent DFA. On the other hand, some problems are easily solved on DFAs and not on UFAs. For example, given an automaton A, an automaton A which accepts the complement of A can be computed in linear time when A is a DFA, whereas it is known that this cannot be done in polynomial time for UFAs. Hence UFAs are a mix of the worlds of DFA and of NFA; in some cases, they lead to smaller automata than DFA and quicker algorithms than NFA.

State complexity is an area of theoretical computer science dealing with the size of abstract automata, such as different kinds of finite automata. The classical result in the area is that simulating an -state nondeterministic finite automaton by a deterministic finite automaton requires exactly states in the worst case.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In the automata theory, a tagged deterministic finite automaton (TDFA) is an extension of deterministic finite automaton (DFA). In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression.

References

  1. OWASP (2010-02-10). "Regex Denial of Service" . Retrieved 2010-04-16.
  2. Davis, James; Louis, Michael; Coghlan, Christy; Servant, Francisco; Lee, Dongyoon (2019). "Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions" (PDF). The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering: 443–454.
  3. RiverStar Software (2010-01-18). "Security Bulletin: Caution Using Regular Expressions". Archived from the original on 2011-07-15. Retrieved 2010-04-16.
  4. Ristic, Ivan (2010-03-15). ModSecurity Handbook. London, UK: Feisty Duck Ltd. p. 173. ISBN   978-1-907117-02-2. Archived from the original on 2016-08-08. Retrieved 2010-04-16.
  5. Crosby and Wallach, Usenix Security (2003). "Regular Expression Denial Of Service". Archived from the original on 2005-03-01. Retrieved 2010-01-13.
  6. Bryan Sullivan, MSDN Magazine (2010-05-03). "Regular Expression Denial of Service Attacks and Defenses" . Retrieved 2010-05-06.{{cite web}}: External link in |author= (help)CS1 maint: numeric names: authors list (link)
  7. Kirrage, J.; Rathnayake, A.; Thielecke, H. (2013). "Static Analysis for Regular Expression Denial-of-Service Attacks". Network and System Security. Madrid, Spain: Springer. pp. 135–148. arXiv: 1301.0849 . doi:10.1007/978-3-642-38631-2_11.
  8. Lazy computation of the DFA can usually reach the speed of deterministic automatons while keeping worst case behavior similar to simulation of an NFA. However, it is considerably more complex to implement and can use more memory.
  9. Jim Manico and Adar Weidman (2009-12-07). "OWASP Podcast 56 (ReDoS)" . Retrieved 2010-04-02.
  10. Barlas, Efe; Du, Xin; Davis, James (2022). "Exploiting Input Sanitization for Regex Denial of Service" (PDF). ACM/IEEE International Conference on Software Engineering: 1–14. arXiv: 2303.01996 .