Leaning toothpick syndrome

Last updated

In computer programming, leaning toothpick syndrome (LTS) is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes ("\"), to avoid delimiter collision. [1] [2]

Contents

The official Perl documentation [3] introduced the term to wider usage; there, the phrase is used to describe regular expressions that match Unix-style paths, in which the elements are separated by slashes /. The slash is also used as the default regular expression delimiter, so to be used literally in the expression, it must be escaped with a backslash \, leading to frequent escaped slashes represented as \/. If doubled, as in URLs, this yields \/\/ for an escaped //. A similar phenomenon occurs for DOS/Windows paths, where the backslash is used as a path separator, requiring a doubled backslash \\ – this can then be re-escaped for a regular expression inside an escaped string, requiring \\\\ to match a single backslash. In extreme cases, such as a regular expression in an escaped string, matching a Uniform Naming Convention path (which begins \\) requires 8 backslashes \\\\\\\\ due to 2 backslashes each being double-escaped.

LTS appears in many programming languages and in many situations, including in patterns that match Uniform Resource Identifiers (URIs) and in programs that output quoted text. Many quines fall into the latter category.

Pattern example

Consider the following Perl regular expression intended to match URIs that identify files under the pub directory of an FTP site:

m/ftp:\/\/[^\/]*\/pub\//

Perl, like sed before it, solves this problem by allowing many other characters to be delimiters for a regular expression. For example, the following three examples are equivalent to the expression given above:

m{ftp://[^/]*/pub/} m#ftp://[^/]*/pub/# m!ftp://[^/]*/pub/! 

Or this common translation to convert backslashes to forward slashes:

tr/\\/\//

may be easier to understand when written like this:

tr{\\}{/}

Quoted-text example

A Perl program to print an HTML link tag, where the URL and link text are stored in variables $url and $text respectively, might look like this. Notice the use of backslashes to escape the quoted double-quote characters:

print"<a href=\"$url\">$text</a>";

Using single quotes to delimit the string is not feasible, as Perl does not expand variables inside single-quoted strings. The code below, for example, would not work as intended:

print'<a href="$url">$text</a>'

Using the printf function is a viable solution in many languages (Perl, C, PHP):

printf('<a href="%s">%s</a>',$url,$text);

The qq operator in Perl allows for any delimiter:

printqq{<a href="$url">$text</a>};printqq|<a href="$url">$text</a>|;printqq(<a href="$url">$text</a>);

Here documents are especially well suited for multi-line strings; however, Perl here documents hadn't allowed for proper indentation before v5.26. [4] This example shows the Perl syntax:

print<<HERE_IT_ENDS;<a href="$url">$text</a>HERE_IT_ENDS

Other languages

C#

The C# programming language handles LTS by the use of the @ symbol at the start of string literals, before the initial quotation marks, e.g.

stringfilePath=@"C:\Foo\Bar.txt";

rather than otherwise requiring:

stringfilePath="C:\\Foo\\Bar.txt";

C++

The C++11 standard adds raw strings:

std::stringfilePath=R"(C:\Foo\Bar.txt)";

If the string contains the characters )", an optional delimiter can be used, such as d in the following example:

std::regexre{R"d(s/"\([^"]*\)"/'\1'/g)d"};

Go

Go indicates that a string is raw by using the backtick as a delimiter:

s:=`C:\Foo\Bar.txt`

Raw strings may contain any character except backticks; there is no escape code for a backtick in a raw string. Raw strings may also span multiple lines, as in this example, where the strings s and t are equivalent:

s:=`A string thatspans multiplelines.`t:="A string that\nspans multiple\nlines."

Python

Python has a similar construct using r:

filePath=r"C:\Foo\Bar.txt"

One can also use them together with triple quotes:

example=r"""First line : "C:\Foo\Bar.txt"Second line : nothing"""

Ruby

Ruby uses single quote to indicate raw string:

filePath='C:\Foo\Bar.txt'

It also has regex percent literals with choice of delimiter like Perl:

%r{ftp://[^/]*/pub/}%r#ftp://[^/]*/pub/#%r!ftp://[^/]*/pub/!

Rust

Rust uses a variant of the r prefix: [5]

"\x52";// Rr"\x52";// \x52r#""foo""#;// "foo"r##"foo #"# bar"##;// foo #"# bar

The literal starts with r followed by any number of #, followed by one ". Further " contained in the literal are considered part of the literal, unless followed by at least as many # as used after the opening r. As such, a string literal opened with r#" cannot have "# in its content.

Scala

Scala allows usage of triple quotes in order to prevent escaping confusion:

valfilePath="""C:\Foo\Bar.txt"""valpubPattern="""ftp://[^/]*/pub/"""r

The triple quotes also allow for multiline strings, as shown here:

valtext="""First line,second line."""

Sed

Sed regular expressions, particularly those using the "s" operator, are much similar to Perl (sed is a predecessor to Perl). The default delimiter is "/", but any delimiter can be used; the default is s/regexp/replacement/, but s:regexp:replacement: is also a valid form. For example, to match a "pub" directory (as in the Perl example) and replace it with "foo", the default (escaping the slashes) is

s/ftp:\/\/[^\/]*\/pub\//foo/

Using an exclamation point ("!") as delimiter instead yields

s!ftp://[^/]*/pub/!foo!

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.

In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding characters.

A metacharacter is a character that has a special meaning to a computer program, such as a shell interpreter or a regular expression (regex) engine.

The backslash\ is a typographical mark used mainly in computing and mathematics. It is the mirror image of the common slash /. It is a relatively recent mark, first documented in the 1930s. It is sometimes called a hack, whack, escape, reverse slash, slosh, downwhack, backslant, backwhack, bash, reverse slant, reverse solidus, and reversed virgule.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

A path is a string of characters used to uniquely identify a location in a directory structure. It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, represent each directory. The delimiting character is most commonly the slash ("/"), the backslash character ("\"), or colon (":"), though some operating systems may use a different delimiter. Paths are used extensively in computer science to represent the directory/file relationships common in modern operating systems and are essential in the construction of Uniform Resource Locators (URLs). Resources can be represented by either absolute or relative paths.

The backtick` is a typographical mark used mainly in computing. It is also known as backquote, grave, or grave accent.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

A query string is a part of a uniform resource locator (URL) that assigns values to specified parameters. A query string commonly includes fields added to a base URL by a Web browser or other client application, for example as part of an HTML document, choosing the appearance of a page, or jumping to positions in multimedia content.

In computing, a here document is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace in the text.

In computer programming, a sigil is a symbol affixed to a variable name, showing the variable's datatype or scope, usually a prefix, as in $foo, where $ is the sigil.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

An INI file is a configuration file for computer software that consists of a text-based content with a structure and syntax comprising key–value pairs for properties, and sections that organize the properties. The name of these configuration files comes from the filename extension INI, for initialization, used in the MS-DOS operating system which popularized this method of software configuration. The format has become an informal standard in many contexts of configuration, but many applications on other operating systems use different file name extensions, such as conf and cfg.

<span class="mw-page-title-main">Python syntax and semantics</span> Set of rules defining correctly structured programs

The syntax of the Python programming language is the set of rules that defines how a Python program will be written and interpreted. The Python language has many similarities to Perl, C, and Java. However, there are some definite differences between the languages. It supports multiple programming paradigms, including structured, object-oriented programming, and functional programming, and boasts a dynamic type system and automatic memory management.

This comparison of programming languages compares the features of language syntax (format) for over 50 computer programming languages.

A nested quotation is a quotation that is encapsulated inside another quotation, forming a hierarchy with multiple levels. When focusing on a certain quotation, one must interpret it within its scope. Nested quotation can be used in literature, speech, and computer science. Nested quotation can be very confusing until evaluated carefully and until each quotation level is put into perspective.

A batch file is a script file in DOS, OS/2 and Microsoft Windows. It consists of a series of commands to be executed by the command-line interpreter, stored in a plain text file. A batch file may contain any command the interpreter accepts interactively and use constructs that enable conditional branching and looping within the batch file, such as IF, FOR, and GOTO labels. The term "batch" is from batch processing, meaning "non-interactive execution", though a batch file might not process a batch of multiple data.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

In computer programming, string interpolation is the process of evaluating a string literal containing one or more placeholders, yielding a result in which the placeholders are replaced with their corresponding values. It is a form of simple template processing or, in formal terms, a form of quasi-quotation. The placeholder may be a variable name, or in some languages an arbitrary expression, in either case evaluated in the current context.

References

  1. Andy Lester, Richard Foley (2005). Pro Perl Debugging. Andy Lester, Richard Foley. p. 176. ISBN   1-59059-454-1.
  2. Daniel Goldman (February 2013). Definitive Guide to sed. EHDP Press. ISBN   978-1-939824-00-4.
  3. perlop at perldoc.perl.org.
  4. Indented Here documents
  5. raw byte string literals at rust-lang.org.