Lexical grammar

Last updated

In computer science, a lexical grammar is a formal grammar defining the syntax of tokens. The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent to the alphabet used by any written language. The lexical grammar lays down the rules governing how a character sequence is divided up into subsequences of characters, each part of which represents an individual token. This is frequently defined in terms of regular expressions. [1]

Contents

For instance, the lexical grammar for many programming languages specifies that a string literal starts with a " character and continues until a matching " is found (escaping makes this more complicated), that an identifier is an alphanumeric sequence (letters and digits, usually also allowing underscores, and disallowing initial digits), and that an integer literal is a sequence of digits. So in the following character sequence "abc" xyz1 23 the tokens are string, identifier and number (plus whitespace tokens) because the space character terminates the sequence of characters forming the identifier. Further, certain sequences are categorized as keywords – these generally have the same form as identifiers (usually alphabetical words), but are categorized separately; formally they have a different token type. [2]

Examples

Regular expressions for common lexical rules follow (for example, C).

Unescaped string literal (quote, followed by non-quotes, ending in a quote):

"[^"]*"

Escaped string literal (quote, followed by escaped characters or non-quotes, ending in a quote):

"(\.|[^\"])*"

Integer literal:

[0-9]+

Decimal integer literal (no leading zero):

[1-9][0-9]*|0

Hexadecimal integer literal:

0[Xx][0-9A-Fa-f]+

Octal integer literal:

0[0-7]+

Identifier:

[A-Za-z_$][A-Za-z0-9_$]*

See also

Related Research Articles

Regular expression Sequence of characters that forms a search pattern

A regular expression is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.

In a computer language, a reserved word is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a reserved word may have no meaning.

S-expression data serialization format

In computer programming, S-expressions are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data. In the usual parenthesized syntax of Lisp, an S-expression is classically defined as

  1. an atom, or
  2. an expression of the form (x. y) where x and y are S-expressions.

In computer science, Backus–Naur form or Backus normal form (BNF) is a notation technique for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols. They are applied wherever exact descriptions of languages are needed: for instance, in official language specifications, in manuals, and in textbooks on programming language theory.

In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programming language. They are extensions of the basic Backus–Naur form (BNF) metasyntax notation.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters into a sequence of tokens. A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

Lex is a computer program that generates lexical analyzers.

A string literal or anonymous string is a type of literal in programming for the representation of a string value within the source code of a computer program. Most often in modern languages this is a quoted sequence of characters, as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value, and one must use a method such as escape sequences to avoid the problem of delimiter collision and allow the delimiters themselves to be embedded in a string. However, there are numerous alternate notations for specifying string literals, particularly more complicated cases, and the exact notation depends on the individual programming language in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

In computer science, primitive data type is either of the following:

The syntax of the C programming language is the set of rules governing writing of software in the language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In computing, a here document is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace in the text.

In computer science, a literal is a notation for representing a fixed value in source code. Almost all programming languages have notations for atomic values such as integers, floating-point numbers, and strings, and usually for booleans and characters; some also have notations for elements of enumerated types and compound values such as arrays, records, and objects. An anonymous function is a literal for the function type.

Syntax (programming languages) Rules of how a program is written in a given programming language

In computer science, the syntax of a computer language is the set of rules that defines the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer language design, stropping is a method of explicitly marking letter sequences as having a special property, such as being a keyword, or a certain type of variable or storage location, and thus inhabiting a different namespace from ordinary names ("identifiers"), in order to avoid clashes. Stropping is not used in most modern languages – instead, keywords are reserved words and cannot be used as identifiers. Stropping allows the same letter sequence to be used both as a keyword and as an identifier, and simplifies parsing in that case – for example allowing a variable named if without clashing with the keyword if.

The syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.

Wirth syntax notation (WSN) is a metasyntax, that is, a formal way to describe formal languages. Originally proposed by Niklaus Wirth in 1977 as an alternative to Backus–Naur form (BNF). It has several advantages over BNF in that it contains an explicit iteration construct, and it avoids the use of an explicit symbol for the empty string.

Escape sequences are used in the programming languages C and C++, and their design was copied in many other languages such as Java and C#. An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly.

In computer science, an integer literal is a kind of literal for an integer whose value is directly represented in source code. For example, in the assignment statement x = 1, the string 1 is an integer literal indicating the value 1, while in the statement x = 0x10 the string 0x10 is an integer literal indicating the value 16, which is represented by 10 in hexadecimal.

In computer languages, identifiers are tokens which name the language entities. Some of the kinds of entities an identifier might denote include variables, types, labels, subroutines, and packages.

References

  1. Buyya (2009). Object-oriented Programming with Java: Essentials and Applications. Tata McGraw-Hill Education. pp. 57–. ISBN   978-0-07-066908-6.
  2. James Gosling (2000). The Java Language Specification. Addison-Wesley Professional. pp. 9–. ISBN   978-0-201-31008-5.