Stropping (syntax)

Last updated

In computer language design, stropping is a method of explicitly marking letter sequences as having a special property, such as being a keyword, or a certain type of variable or storage location, and thus inhabiting a different namespace from ordinary names ("identifiers"), in order to avoid clashes. Stropping is not used in most modern languages – instead, keywords are reserved words and cannot be used as identifiers. Stropping allows the same letter sequence to be used both as a keyword and as an identifier, and simplifies parsing in that case – for example allowing a variable named if without clashing with the keyword if.

Contents

Stropping is primarily associated with ALGOL and related languages in the 1960s. Though it finds some modern use, it is easily confused with other similar techniques that are superficially similar.

History

The method of stropping and the term "stropping" arose in the development of ALGOL in the 1960s, where it was used to represent typographical distinctions (boldface and underline) found in the publication language which could not directly be represented in the hardware language – a typewriter could have bold characters, but in encoding in punch cards, there were no bold characters. The term "stropping" arose in ALGOL 60, from "apostrophe", as some implementations of ALGOL 60 used apostrophes around text to indicate boldface, [1] such as 'if' to represent the keyword if. Stropping is also important in ALGOL 68, where multiple methods of stropping, known as "stropping regimes", are used; the original matched apostrophes from ALGOL 60 was not widely used, with a leading period or uppercase being more common, [2] as in .IF or IF and the term "stropping" was applied to all of these.

Syntaxes

A range of different syntaxes for stropping have been used:

In fact it was often the case that several stropping conventions might be in use within one language. For example, in ALGOL 68, the choice of stropping convention can be specified by a compiler directive (in ALGOL terminology, a "pragmat"), namely POINT, UPPER, QUOTE, or RES:

The various rules regimes are a lexical specification for stropped characters, though in some cases these have simple interpretations: in the single apostrophe and dot regimes, the first character is functioning as an escape character, while in the matched apostrophes regime the apostrophes are functioning as delimiters, as in string literals.

Other examples:

Examples of different ALGOL 68 styles

Note the leading pr (abbreviation of pragmat) directive, which is itself stropped in POINT or quote style, and the ¢ for comment (from "") – see ALGOL 68: pr & co: Pragmats and Comments for details.

Algol68 "strict"
as typically published
Quote stropping
(like wikitext)
For a 7-bit character
code compiler
For a 6-bit character
code compiler
Algol68 using res stropping
(reserved word)
''¢ underline or''    ''bold typeface ¢''  '''mode''' '''xint''' = '''int''';  '''xint''' sum sq:=0;  '''for''' i '''while'''    sum sq≠70×70  '''do'''    sum sq+:=i↑2  '''od'''
'pr' quote 'pr' 'mode' 'xint' = 'int'; 'xint' sum sq:=0; 'for' i 'while'   sum sq≠70×70 'do'   sum sq+:=i↑2 'od' 
.PRUPPER.PR MODE XINT = INT; XINT sum sq:=0; FOR i WHILE   sum sq/=70*70 DO   sum sq+:=i**2 OD 
.PRPOINT.PR .MODE.XINT=.INT; .XINTSUMSQ:=0; .FORI.WHILE   SUM SQ .NE 70*70 .DO   SUM SQ .PLUSAB I .UP 2 .OD
.PRRES.PR mode .xint = int; .xintsumsq:=0; for i while   sum sq≠70×70 do   sum sq+:=i↑2 od 

Other languages

For various reasons Fortran 77 has these "logical" values and operators: .TRUE., .FALSE., .EQ., .NE., .LT., .LE., .GT., .GE., .EQV., .NEQV., .OR., .AND., .NOT. [5]

.AND., .OR. and .XOR. are also used in combined tests in IF and IFF statements in batch files run under JP Software's command line processors like 4DOS, [6] 4OS2, and 4NT / Take Command.


Modern use

To indicate identifiers

Most modern computer languages do not use stropping. However, some languages support optional stropping to specify identifiers that would otherwise collide with reserved words or which contain non-alphanumeric characters.

For example, the use of many languages in Microsoft's .NET Common Language Infrastructure (CLI) requires a way to use variables in a different language that may be keywords in a calling language. This is sometimes done by prefixes, such as @ in C#, or enclosing the identifier in brackets, in Visual Basic.NET.

A second major example is in many implementations of Structured Query Language. In those languages reserved words can be used as column, table, or variable names by lexically delimiting them. The standard specifies enclosing reserved words in double quotes, but in practice the exact mechanism varies by implementation; MySQL, for example, allows reserved words to be used in other contexts by enclosing them in backticks, and Microsoft SQL Server uses square brackets.

In several languages, including Nim, R, [7] and Scala, [8] a reserved word or non-alphanumeric name can be used as an identifier by enclosing it in backticks.

There are other, more minor examples. For example, Web IDL uses a leading underscore _ to strop identifiers that otherwise collide with reserved words: the value of the identifier strips this leading underscore, making this stropping, rather than a naming convention. [9]

Other purposes

In Haskell, surrounding a function name by backticks causes it to be parsed as an infix operator.

Unstropping by the compiler

In a compiler front end, unstropping originally occurred during an initial line reconstruction phase, which also eliminated whitespace. This was then followed by scannerless parsing (no tokenization); this was standard in the 1960s, notably for ALGOL. In modern use, unstropping is generally done as part of lexical analysis. This is clear if one distinguishes the lexer into two phases of scanner and evaluator: the scanner categorizes the stropped sequence into the correct category, and then the evaluator unstrops when calculating the value. For example, in a language where an initial underscore is used to strop identifiers to avoid collisions with reserved words, the sequence _if would be categorized as an identifier (not as the reserved word if) by the scanner, and then the evaluator would give this the value if, yielding (Identifier, if) as the token type and value.

Similar techniques

A number of similar techniques exist, generally prefixing or suffixing an identifier to indicate different treatment, but the semantics are varied. Strictly speaking, stropping consists of different representations of the same name (value) in different namespaces, and occurs at the tokenization stage. For example, in ALGOL 60 with matched apostrophe stropping, 'if' is tokenized as (Keyword, if), while if is tokenized as (Identifier, if) – same value in different token classes.

Using uppercase for keywords remains in use as a convention for writing grammars for lexing and parsing – tokenizing the reserved word if as the token class IF, and then representing an if-then-else clause by the phrase IF Expression THEN Statement ELSE Statement where uppercase terms are keywords and capitalized terms are nonterminal symbols in a production rule (terminal symbols are denoted by lowercase terms, such as identifier or integer, for an integer literal).

Naming conventions

Most loosely, one may use naming conventions to avoid clashes, commonly prefixing or suffixing with an underscore, as in if_ or _then. A leading underscore is often used to indicate private members in object-oriented programming.

These names may be interpreted by the compiler and have some effect, though this is generally done at the semantic analysis phase, not the tokenization phase. For example, in Python, a single leading underscore is a weak private indicator, and affects which identifiers are imported on module import, while a double leading underscore (and no more than one trailing underscore) on a class attribute invokes name mangling. [10]

Reserved words

While modern languages generally use reserved words rather than stropping to distinguish keywords from identifiers – e.g., making if reserved – they also frequently reserve a syntactic class of identifiers as keywords, yielding representations which can be interpreted as a stropping regime, but instead have the semantics of reserved words.

This is most notable in C, where identifiers that begin with an underscore are reserved, though the precise details of what identifiers are reserved at what scope are involved, and leading double underscores are reserved for any use; [11] similarly in C++ any identifier that contains a double underscore is reserved for any use, while an identifier that begins with an underscore is reserved in the global space. [nb 1] Thus one can add a new keyword foo using the reserved word __foo. While this is superficially similar to stropping, the semantics are different. As a reserved word, the string __foo represents the identifier __foo in the common identifier namespace. In stropping (by prefixing keywords by __), the string __foo represents the keyword foo in a separate keyword namespace. Thus using reserved words, the tokens for __foo and foo are (identifier, __foo) and (identifier, foo) – different values in the same category – while in stropping the tokens for __foo and foo are (keyword, foo) and (identifier, foo) – same values in different categories. These solve the same problem of namespace clashes in a way that is the same for a programmer, but which differs in terms of formal grammar and implementation.

Name mangling

Name mangling also addresses name clashes by renaming identifiers, but does this much later in compilation, during semantic analysis, not during tokenization. This consists of creating names that include scope and type information, primarily for use by linkers, both to avoid clashes and to include necessary semantic information in the name itself. In these cases the original identifiers may be identical, but the context is different, as in the functions foo(int x) versus foo(char x), in both cases having the same identifier foo, but different signature. These names might be mangled to foo_i and foo_c, for instance, to include the type information.

Sigils

A syntactically similar but semantically different phenomenon are sigils, which instead indicate properties of variables. These are common in BASIC, Perl, Ruby, and various other languages to identify characteristics of variables/constants: BASIC and Perl to designate the type of variable, Ruby both to distinguish variables from constants and to indicate scope. Note that this affects the semantics of the variable, not the syntax of whether it is an identifier or keyword.

Parallels in human language

Stropping is used in computer programming languages to make the compiler's (or more strictly, the parser's) job easier, i.e. within the capability of the relatively small and slow computers available in early days of computing in the 20th century. However, similar techniques have been commonly used to aid reading comprehension for people too. Some examples are:

See also

Notes

  1. There are other restrictions, such as an identifier that begins with an underscore, followed by an uppercase letter.

Related Research Articles

Atlas Autocode (AA) is a programming language developed around 1963 at the University of Manchester. A variant of the language ALGOL, it was developed by Tony Brooker and Derrick Morris for the Atlas computer. The initial AA and AB compilers were written by Jeff Rohl and Tony Brooker using the Brooker-Morris Compiler-compiler, with a later hand-coded non-CC implementation (ABC) by Jeff Rohl.

Rebol is a cross-platform data exchange language and a multi-paradigm dynamic programming language designed by Carl Sassenrath for network communications and distributed computing. It introduces the concept of dialecting: small, optimized, domain-specific languages for code and data, which is also the most notable property of the language according to its designer Carl Sassenrath:

Although it can be used for programming, writing functions, and performing processes, its greatest strength is the ability to easily create domain-specific languages or dialects

In a computer language, a reserved word is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a reserved word may have no user-defined meaning.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

ALGOL W is a programming language. It is based on a proposal for ALGOL X by Niklaus Wirth and Tony Hoare as a successor to ALGOL 60. ALGOL W is a relatively simple upgrade of the original ALGOL 60, adding string, bitstring, complex number and reference to record data types and call-by-result passing of parameters, introducing the while statement, replacing switch with the case statement, and generally tightening up the language.

In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record, which could be defined to contain both a float and an integer; in a union, there is only one value at any given time.

<span class="mw-page-title-main">ALGOL 68</span> Programming language

ALGOL 68 is an imperative programming language that was conceived as a successor to the ALGOL 60 programming language, designed with the goal of a much wider scope of application and more rigorously defined syntax and semantics.

In compiler construction, name mangling is a technique used to solve various problems caused by the need to resolve unique names for programming entities in many modern programming languages.

In computer programming, a directive or pragma is a language construct that specifies how a compiler should process its input. Depending on the programming language, directives may or may not be part of the grammar of the language and may vary from compiler to compiler. They can be processed by a preprocessor to specify compiler behavior, or function as a form of in-band parameterization.

In computer programming, a statement is a syntactic unit of an imperative programming language that expresses some action to be carried out. A program written in such a language is formed by a sequence of one or more statements. A statement may have internal components.

In computer programming, a sigil is a symbol affixed to a variable name, showing the variable's datatype or scope, usually a prefix, as in $foo, where $ is the sigil.

In computer programming, a naming convention is a set of rules for choosing the character sequence to be used for identifiers which denote variables, types, functions, and other entities in source code and documentation.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

In computer programming, a one-pass compiler is a compiler that passes through the parts of each compilation unit only once, immediately translating each part into its final machine code. This is in contrast to a multi-pass compiler which converts the program into one or more intermediate representations in steps between source code and machine code, and which reprocesses the entire compilation unit in each sequential pass.

This comparison of programming languages compares the features of language syntax (format) for over 50 computer programming languages.

ALGOL 68-R was the first implementation of the Algorithmic Language ALGOL 68.

In computer programming, scope is an enclosing context where values and expressions are associated. The scope resolution operator helps to identify and specify the context to which an identifier refers, particularly by specifying a namespace or class. The specific uses vary across different programming languages with the notions of scoping. In many languages, the scope resolution operator is written ::.

This article describes the syntax of the C# programming language. The features described are compatible with .NET Framework and Mono.

In computer science, an integer literal is a kind of literal for an integer whose value is directly represented in source code. For example, in the assignment statement x = 1, the string 1 is an integer literal indicating the value 1, while in the statement x = 0x10 the string 0x10 is an integer literal indicating the value 16, which is represented by 10 in hexadecimal.

In computer programming languages, an identifier is a lexical token that names the language's entities. Some of the kinds of entities an identifier might denote include variables, data types, labels, subroutines, and modules.

References

  1. King, Peter R., ed. (1974-06-18). "(unknown)". Proceedings of an International Conference on ALGOL 68 Implementation. Department of Computer Science, University of Manitoba, Winnipeg: University of Manitoba, Department of Computer Science: 148. ISBN   9780919628113. More serious problems are posed by "stropping", the technique used to distinguish boldface text from roman text. Some implementations demand apostrophes around boldface (whence the name stropping); others require backspacing and underlining; [...]{{cite journal}}: Cite uses generic title (help)
  2. 1 2 van Wijngaarden, Adriaan; Mailloux, Barry James; Peck, John Edward Lancelot; Koster, Cornelis Hermanus Antonius; Sintzoff, Michel [in French]; Lindsey, Charles Hodgson; Meertens, Lambert Guillaume Louis Théodore; Fisker, Richard G., eds. (1976). "Section 9.3 Representations" (PDF). Revised Report on the Algorithmic Language ALGOL 68. Springer-Verlag. pp. 94, 123. ISBN   978-0-387-07592-1. OCLC   1991170. Archived (PDF) from the original on 2019-04-19. Retrieved 2019-05-11.
  3. http://www.fh-jena.de/~kleine/history/languages/Algol68-RR-HardwareRepresentation.pdf [ dead link ]
  4. Lindsey, Charles Hodgson; van der Meulen, Sietse G. (1977). Informal Introduction to ALGOL 68. North-Holland. pp. 348–349. ISBN   978-0-7204-0726-6. OCLC   230034877.
  5. "Logical Structures".
  6. Brothers, Hardin; Rawson, Tom; Conn, Rex C.; Paul, Matthias R.; Dye, Charles E.; Georgiev, Luchezar I. (2002-02-27). 4DOS 8.00 online help.
  7. R Core Team, Quotes: Quotes, R Foundation for Statistical Computing.
  8. Odersky, Martin (2011-05-24), The Scala Language Specification Version 2.9
  9. Web IDL , "3.1. Names". [...] For all of these constructs, the identifier is the value of the identifier token with any single leading U+005F LOW LINE ("_") character (underscore) removed. [...] Note [...] A leading "_" is used to escape an identifier from looking like a reserved word so that, for example, an interface named “interface” can be defined. The leading "_" is dropped to unescape the identifier. [...]
  10. PEP 008: Descriptive: Naming Styles
  11. C99 standard, 7.1.3 Reserved identifiers
  12. Twyman, Michael. "The Bold Idea: The Use of Bold-looking Types in the Nineteenth Century". Journal of the Printing Historical Society. 22 (107–143).
  13. Truss, Lynne (2004), Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation, New York: Gotham Books, p. 146, ISBN   978-1-59240-087-4
  14. "Styles of Handwriting". Rigsarkivet. The Danish National Archives. Retrieved 2017-03-26.
  15. "How to Write Scientific Names of Organisms" (PDF), Competition Science Vision, retrieved 2011-06-20.
  16. A Selection of Legal Maxims, classified and illustrated at Google Books
  17. Dual 大辞林
    「平」とは平凡な、やさしいという意で、当時普通に使用する文字体系であったことを意味する。 漢字は書簡文や重要な文章などを書く場合に用いる公的な文字であるのに対して、 平仮名は漢字の知識に乏しい人々などが用いる私的な性格のものであった。
    Translation: 平 [the "hira" part of "hiragana"] means "ordinary" or "simple" since at that time [the time that the name was given] it was a writing system for everyday use. While kanji was the official system used for letter-writing and important texts, hiragana was for personal use by people who had limited knowledge of kanji.
  18. "Japanese calligraphy". Encyclopedia Britannica. Retrieved 2017-06-22.
  19. "Hiragana, Katakana & Kanji". Japanese Word Characters. 2010-09-08. Retrieved 2011-10-15.

Further reading