Raku rules

Last updated

Raku rules are the regular expression, string matching and general-purpose parsing facility of Raku, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

Contents

Raku provides a superset of Perl 5 features with respect to regexes, folding them into a larger framework called rules, which provide the capabilities of a parsing expression grammar, as well as acting as a closure with respect to their lexical scope. [1] Rules are introduced with the rule keyword, which has a usage quite similar to subroutine definitions. Anonymous rules can be introduced with the regex (or rx) keyword, or simply be used inline as regexes were in Perl 5 via the m (matching) or s (substitution) operators.

History

In Apocalypse 5, a document outlining the preliminary design decisions for Raku pattern matching, Larry Wall enumerated 20 problems with the "current regex culture". Among these were that Perl's regexes were "too compact and 'cute'", had "too much reliance on too few metacharacters", "little support for named captures", "little support for grammars", and "poor integration with 'real' language". [2]

Between late 2004 and mid-2005, a compiler for Raku style rules was developed for the Parrot virtual machine called Parrot Grammar Engine (PGE), which was later renamed to the more generic Parser Grammar Engine. PGE is a combination of runtime and compiler for Raku style grammars that allows any parrot-based compiler to use these tools for parsing, and also to provide rules to their runtimes.

Among other Raku features, support for named captures was added to Perl 5.10 in 2007. [3]

In May 2012, the reference implementation of Raku, Rakudo, shipped its Rakudo Star monthly snapshot with a working JSON parser built entirely in Raku rules. [4]

Changes from Perl 5

There are only six unchanged features from Perl 5's regexes:

A few of the most powerful additions include:

The following changes greatly improve the readability of regexes:

Implicit changes

Some of the features of Perl 5 regular expressions are more powerful in Raku because of their ability to encapsulate the expanded features of Raku rules. For example, in Perl 5, there were positive and negative lookahead operators (?=...) and (?!...). In Raku these same features exist, but are called <before ...> and <!before ...>.

However, because before can encapsulate arbitrary rules, it can be used to express lookahead as a syntactic predicate for a grammar. For example, the following parsing expression grammar describes the classic non-context-free language :

S ← &(A !b) a+ B A ← a A? b B ← b B? c

In Raku rules that would be:

rule S { <before <A> <!before b>> a+ <B> } rule A { a <A>? b } rule B { b <B>? c } 

Of course, given the ability to mix rules and regular code, that can be simplified even further:

rule S { (a+) (b+) (c+) <{$0.elems == $1.elems == $2.elems}> } 

However, this makes use of assertions, which is a subtly different concept in Raku rules, but more substantially different in parsing theory, making this a semantic rather than syntactic predicate. The most important difference in practice is performance. There is no way for the rule engine to know what conditions the assertion may match, so no optimization of this process can be made.

Integration with Perl

In many languages, regular expressions are entered as strings, which are then passed to library routines that parse and compile them into an internal state. In Perl 5, regular expressions shared some of the lexical analysis with Perl's scanner. This simplified many aspects of regular expression usage, though it added a great deal of complexity to the scanner. In Raku, rules are part of the grammar of the language. No separate parser exists for rules, as it did in Perl 5. This means that code, embedded in rules, is parsed at the same time as the rule itself and its surrounding code. For example, it is possible to nest rules and code without re-invoking the parser:

rule ab {    (a.) # match "a" followed by any character# Then check to see if that character was "b"# If so, print a message.{ $0 ~~ /b {say "found the b"}/ } } 

The above is a single block of Raku code that contains an outer rule definition, an inner block of assertion code, and inside of that a regex that contains one more level of assertion.

Implementation

Keywords

There are several keywords used in conjunction with Raku rules:

regex
A named or anonymous regex that ignores whitespace within the regex by default.
token
A named or anonymous regex that implies the :ratchet modifier.
rule
A named or anonymous regex that implies the :ratchet and :sigspace modifiers.
rx
An anonymous regex that takes arbitrary delimiters such as // where regex only takes braces.
m
An operator form of anonymous regex that performs matches with arbitrary delimiters.
mm
Shorthand for m with the :sigspace modifier.
s
An operator form of anonymous regex that performs substitution with arbitrary delimiters.
ss
Shorthand for s with the :sigspace modifier.
/.../
Simply placing a regex between slashes is shorthand for rx/.../.

Here is an example of typical use:

token word { \w+ } rule phrase { <word> [ \, <word> ]* \. } if$string ~~ / <phrase> \n / {     ... } 

Modifiers

Modifiers may be placed after any of the regex keywords, and before the delimiter. If a regex is named, the modifier comes after the name. Modifiers control the way regexes are parsed and how they behave. They are always introduced with a leading : character.

Some of the more important modifiers include:

For example:

regex addition { :ratchet :sigspace <term> \+ <expr> } 

Grammars

A grammar may be defined using the grammar operator. A grammar is essentially just a namespace for rules:

grammarStr::SprintfFormat {     regex format_token { \%: <index>? <precision>? <modifier>? <directive> }     token index { \d+ \$ }     token precision { <flags>? <vector>? <precision_count> }     token flags { <[\ +0\#\-]>+ }     token precision_count { [ <[1-9]>\d* | \* ]? [ \. [ \d* | \* ] ]? }     token vector { \*? v }     token modifier { ll | <[lhmVqL]> }     token directive { <[\%csduoxefgXEGbpniDUOF]> } } 

This is the grammar used to define Perl's sprintf string formatting notation.

Outside of this namespace, you could use these rules like so:

if / <Str::SprintfFormat::format_token> / { ... } 

A rule used in this way is actually identical to the invocation of a subroutine with the extra semantics and side-effects of pattern matching (e.g., rule invocations can be backtracked).

Examples

Here are some example rules in Raku:

rx{a[b|c](d|e)f:g}rx{(ab*)<{$1.size%2==0}>}

That last is identical to:

rx{(ab[bb]*)}

Related Research Articles

Perl Interpreted programming language first released in 1987

Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was officially changed to Raku in October 2019.

Regular expression Sequence of characters that forms a search pattern

A regular expression is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters into a sequence of tokens. A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

A string literal or anonymous string is a type of literal in programming for the representation of a string value within the source code of a computer program. Most often in modern languages this is a quoted sequence of characters, as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value, and one must use a method such as escape sequences to avoid the problem of delimiter collision and allow the delimiters themselves to be embedded in a string. However, there are numerous alternate notations for specifying string literals, particularly more complicated cases, and the exact notation depends on the individual programming language in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

A computer programming language is said to adhere to the off-side rule if blocks in that language are expressed by their indentation. The term was coined by Peter J. Landin, possibly as a pun on the offside rule in football. This is contrasted with free-form languages, notably curly-bracket programming languages, where indentation is not meaningful and indent style is only a matter of convention and code formatting.

Raku (programming language) Programming language derived from Perl

Raku is a member of the Perl family of programming languages. Formerly known as Perl 6, it was renamed in October 2019. Raku introduces elements of many modern and historical languages. Compatibility with Perl was not a goal, though a compatibility mode is part of the specification. The design process for Raku began in 2000.

In computer science, an operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

Syntax (programming languages)

In computer science, the syntax of a computer language is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer science, scannerless parsing performs tokenization and parsing in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical and phrase level structure of the language.

String functions are used in computer programming languages to manipulate a string or query information about a string.

v6 is a module for the Perl programming language which runs under Perl version 5, and transforms Raku code into Perl 5 code on the fly. To quote the release notes:

The Parser Grammar Engine is a compiler and runtime for Raku rules for the Parrot virtual machine. PGE uses these rules to convert a parsing expression grammar into Parrot bytecode. It is therefore compiling rules into a program, unlike most virtual machines and runtimes, which store regular expressions in a secondary internal format that is then interpreted at runtime by a regular expression engine. The rules format used by PGE can express any regular expression and most formal grammars, and as such it forms the first link in the compiler chain for all of Parrot's front-end languages.

A syntactic predicate specifies the syntactic validity of applying a production in a formal grammar and is analogous to a semantic predicate that specifies the semantic validity of applying a production. It is a simple and effective means of dramatically improving the recognition strength of an LL parser by providing arbitrary lookahead. In their original implementation, syntactic predicates had the form “( α )?” and could only appear on the left edge of a production. The required syntactic condition α could be any valid context-free grammar fragment.

This comparison of programming languages compares the features of language syntax (format) for over 50 computer programming languages.

This is a comparison of regular-expression engines.

RegexBuddy

RegexBuddy is a regular expression programming tool by Just Great Software Co. Ltd. for the Microsoft Windows operating system. It provides an interface for building, testing, and debugging regular expressions, in addition to a library of commonly used regular expressions, an interface for generating code to use regular expressions in the desired programming environment, a graphical grep tool for searching through files and directories, and an integrated forum for seeking and providing regular expression advice with other RegexBuddy users.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

The following outline is provided as an overview of and topical guide to the Perl programming language:

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In computer languages, identifiers are tokens which name language entities. Some of the kinds of entities an identifier might denote include variables, types, labels, subroutines, and packages.

References

  1. Wall, Larry (June 24, 2002). "Synopsis 5: Regexes and Rules".
  2. Wall, Larry (June 4, 2002). "Apocalypse 5: Pattern Matching".
  3. Perl 5.10 now available - Perl Buzz Archived 2008-01-09 at the Wayback Machine
  4. moritz (May 5, 2012). "Rakudo Star 2012.05 released".