Programming language specification

Last updated

In computer programming, a programming language specification (or standard or definition) is a documentation artifact that defines a programming language so that users and implementors can agree on what programs in that language mean. Specifications are typically detailed and formal, and primarily used by implementors, with users referring to them in case of ambiguity; the C++ specification is frequently cited by users, for instance, due to the complexity. Related documentation includes a programming language reference, which is intended expressly for users, and a programming language rationale, which explains why the specification is written as it is; these are typically more informal than a specification.

Contents

Standardization

Not all major programming languages have specifications, and languages can exist and be popular for decades without a specification. A language may have one or more implementations, whose behavior acts as a de facto standard, without this behavior being documented in a specification. Perl (through Perl 5) is a notable example of a language without a specification, while PHP was only specified in 2014, after being in use for 20 years. [1] A language may be implemented and then specified, or specified and then implemented, or these may develop together, which is usual practice today. This is because implementations and specifications provide checks on each other: writing a specification requires precisely stating the behavior of an implementation, and implementation checks that a specification is possible, practical, and consistent. Writing a specification before an implementation has largely been avoided since ALGOL 68 (1968), due to unexpected difficulties in implementation when implementation is deferred. However, languages are still occasionally implemented and gain popularity without a formal specification: an implementation is essential for use, while a specification is desirable but not essential (informally, "code talks").

ALGOL 68 was the first (and possibly one of the last) major language for which a full formal definition was made before it was implemented.

Forms

A programming language specification can take several forms, including the following:

Syntax

The syntax of a programming language represents the definition of acceptable words, i.e. formal parameters and rules upon which to decide whether a given code is valid in respect to the language. On that note, the language syntax usually consists of a combination of the following three construction components:

Syntax specification generally supposes a natural language description in order to provide modest comprehensibility. However, the formal representation of the above outlined components is usually part of the section as it favors the implementation and approval of the language and its concepts.

Semantics

Formulating a rigorous semantics of a large, complex, practical programming language is a daunting task even for experienced specialists, and the resulting specification can be difficult for anyone but experts to understand. The following are some of the ways in which programming language semantics can be described; all languages use at least one of these description methods, and some languages combine more than one [5]

Natural language

Most widely used languages are specified using natural language descriptions of their semantics. This description usually takes the form of a reference manual for the language. These manuals can run to hundreds of pages, e.g., the print version of The Java Language Specification, 3rd Ed. is 596 pages long.

The imprecision of natural language as a vehicle for describing programming language semantics can lead to problems with interpreting the specification. For example, the semantics of Java threads were specified in English, and it was later discovered that the specification did not provide adequate guidance for implementors. [6]

Formal semantics

Formal semantics are grounded in mathematics. As a result, they can be more precise and less ambiguous than semantics given in natural language. However, supplemental natural language descriptions of the semantics are often included to aid understanding of the formal definitions. For example, The ISO Standard for Modula-2 contains both a formal and a natural language definition on opposing pages.

Programming languages whose semantics are described formally can reap many benefits. For example:

Automatic tool support can help to realize some of these benefits. For example, an automated theorem prover or theorem checker can increase a programmer's (or language designer's) confidence in the correctness of proofs about programs (or the language itself). The power and scalability of these tools varies widely: full formal verification is computationally intensive, rarely scales beyond programs containing a few hundred lines[ citation needed ] and may require considerable manual assistance from a programmer; more lightweight tools such as model checkers require fewer resources and have been used on programs containing tens of thousands of lines; many compilers apply static type checks to any program they compile.

Reference implementation

A reference implementation is a single implementation of a programming language that is designated as authoritative. The behavior of this implementation is held to define the proper behavior of a program written in the language. This approach has several attractive properties. First, it is precise, and requires no human interpretation: disputes as to the meaning of a program can be settled simply by executing the program on the reference implementation (provided that the implementation behaves deterministically for that program).

On the other hand, defining language semantics through a reference implementation also has several potential drawbacks. Chief among them is that it conflates limitations of the reference implementation with properties of the language. For example, if the reference implementation has a bug, then that bug must be considered to be an authoritative behavior. Another drawback is that programs written in this language may rely on quirks in the reference implementation, hindering portability across different implementations.

Nevertheless, several languages have successfully used the reference implementation approach. For example, the Perl interpreter is considered to define the authoritative behavior of Perl programs. In the case of Perl, the open-source model of software distribution has contributed to the fact that nobody has ever produced another implementation of the language, so the issues involved in using a reference implementation to define the language semantics are moot.

Test suite

Defining the semantics of a programming language in terms of a test suite involves writing a number of example programs in the language, and then describing how those programs ought to behave perhaps by writing down their correct outputs. The programs, plus their outputs, are called the "test suite" of the language. Any correct language implementation must then produce exactly the correct outputs on the test suite programs.

The chief advantage of this approach to semantic description is that it is easy to determine whether a language implementation passes a test suite. The user can simply execute all the programs in the test suite, and compare the outputs to the desired outputs. However, when used by itself, the test suite approach has major drawbacks as well. For example, users want to run their own programs, which are not part of the test suite; indeed, a language implementation that could only run the programs in its test suite would be largely useless. But a test suite does not, by itself, describe how the language implementation should behave on any program not in the test suite; determining that behavior requires some extrapolation on the implementor's part, and different implementors may disagree. In addition, it is difficult to use a test suite to test behavior that is intended or allowed to be nondeterministic.

Therefore, in common practice, test suites are used only in combination with one of the other language specification techniques, such as a natural language description or a reference implementation.

See also

Language specifications

A few examples of official or draft language specifications:

Notes

  1. Announcing a specification for PHP, July 30, 2014, Joel Marcey
  2. "A Shorter History of Algol68". Archived from the original on August 10, 2006. Retrieved September 15, 2006.
  3. Milner, R.; M. Tofte; R. Harper; D. MacQueen (1997). The Definition of Standard ML (Revised). MIT Press. ISBN   0-262-63181-4.
  4. Kelsey, Richard; William Clinger; Jonathan Rees (February 1998). "Section 7.2 Formal semantics". Revised5 Report on the Algorithmic Language Scheme. Retrieved 2006-06-09.
  5. Jones, D. (2008). Forms of language specification (PDF). Retrieved 2012-06-23.
  6. William Pugh. The Java Memory Model is Fatally Flawed. Concurrency: Practice and Experience 12(6):445-455, August 2000

Related Research Articles

In computer science, an abstract data type (ADT) is a mathematical model for data types, defined by its behavior (semantics) from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This mathematical model contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user. For example, a stack has push/pop operations that follow a Last-In-First-Out rule, and can be concretely implemented using either a list or an array. Another example is a set which stores values, without any particular order, and no repeated values. Values themselves are not retrieved from sets, rather one tests a value for membership to obtain a Boolean "in" or "not in".

<span class="mw-page-title-main">Programming language</span> Language for communicating instructions to a machine

A programming language is a system of notation for writing computer programs.

In computer science, pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages with informal, usually self-explanatory, notation of actions and conditions. Although pseudocode shares features with regular programming languages, it is intended for human reading rather than machine control. Pseudocode typically omits details that are essential for machine implementation of the algorithm. The programming language is augmented with natural language description details, where convenient, or with compact mathematical notation. The purpose of using pseudocode is that it is easier for people to understand than conventional programming language code, and that it is an efficient and environment-independent description of the key principles of an algorithm. It is commonly used in textbooks and scientific publications to document algorithms and in planning of software and other algorithms.

In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer engineering, a hardware description language (HDL) is a specialized computer language used to describe the structure and behavior of electronic circuits, most commonly to design ASICs and program FPGAs.

Operational semantics is a category of formal programming language semantics in which certain desired properties of a program, such as correctness, safety or security, are verified by constructing proofs from logical statements about its execution and procedures, rather than by attaching mathematical meanings to its terms. Operational semantics are classified in two categories: structural operational semantics formally describe how the individual steps of a computation take place in a computer-based system; by opposition natural semantics describe how the overall results of the executions are obtained. Other approaches to providing a formal semantics of programming languages include axiomatic semantics and denotational semantics.

The printf family of functions in the C programming language are a set of functions that take a format string as input among a variable sized list of other values and produce as output a string that corresponds to the format specifier and given input values. The string is written in a simple template language: characters are usually copied literally into the function's output, but format specifiers, which start with a % character, indicate the location and method to translate a piece of data to characters. The design has been copied to expose similar functionality in other programming languages.

A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging from widely used languages for common domains, such as HTML for web pages, down to languages used by only one or a few pieces of software, such as MUSH soft code. DSLs can be further subdivided by the kind of language, and include domain-specific markup languages, domain-specific modeling languages, and domain-specific programming languages. Special-purpose computer languages have always existed in the computer age, but the term "domain-specific language" has become more popular due to the rise of domain-specific modeling. Simpler DSLs, particularly ones used by a single application, are sometimes informally called mini-languages.

ALGOL 60 is a member of the ALGOL family of computer programming languages. It followed on from ALGOL 58 which had introduced code blocks and the begin and end pairs for delimiting them, representing a key advance in the rise of structured programming. ALGOL 60 was one of the first languages implementing function definitions. ALGOL 60 function definitions could be nested within one another, with lexical scope. It gave rise to many other languages, including CPL, PL/I, Simula, BCPL, B, Pascal, and C. Practically every computer of the era had a systems programming language based on ALGOL 60 concepts.

<span class="mw-page-title-main">ALGOL 68</span> Programming language

ALGOL 68 is an imperative programming language that was conceived as a successor to the ALGOL 60 programming language, designed with the goal of a much wider scope of application and more rigorously defined syntax and semantics.

In computer programming, operators are constructs defined within programming languages which behave generally like functions, but which differ syntactically or semantically.

In computer programming, a directive or pragma is a language construct that specifies how a compiler should process its input. Depending on the programming language, directives may or may not be part of the grammar of the language and may vary from compiler to compiler. They can be processed by a preprocessor to specify compiler behavior, or function as a form of in-band parameterization.

In computer programming, a statement is a syntactic unit of an imperative programming language that expresses some action to be carried out. A program written in such a language is formed by a sequence of one or more statements. A statement may have internal components.

In computer science, a Van Wijngaarden grammar is a formalism for defining formal languages. The name derives from the formalism invented by Adriaan van Wijngaarden for the purpose of defining the ALGOL 68 programming language. The resulting specification remains its most notable application.

In computer science, formal specifications are mathematically based techniques whose purpose are to help with the implementation of systems and software. They are used to describe a system, to analyze its behavior, and to aid in its design by verifying key properties of interest through rigorous and effective reasoning tools. These specifications are formal in the sense that they have a syntax, their semantics fall within one domain, and they are able to be used to infer useful information.

In software engineering, behavior-driven development (BDD) is a software development process that encourages collaboration among developers, quality assurance experts, and customer representatives in a software project. It encourages teams to use conversation and concrete examples to formalize a shared understanding of how the application should behave. It emerged from test-driven development (TDD) and can work alongside an agile software development process. Behavior-driven development combines the general techniques and principles of TDD with ideas from domain-driven design and object-oriented analysis and design to provide software development and management teams with shared tools and a shared process to collaborate on software development.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

Programming languages are used for controlling the behavior of a machine. Like natural languages, programming languages follow rules for syntax and semantics.

The Test Anything Protocol (TAP) is a protocol to allow communication between unit tests and a test harness. It allows individual tests to communicate test results to the testing harness in a language-agnostic way. Originally developed for unit testing of the Perl interpreter in 1987, producers and parsers are now available for many development platforms.

The JAUS Tool Set (JTS) is a software engineering tool for the design of software services used in a distributed computing environment. JTS provides a Graphical User Interface (GUI) and supporting tools for the rapid design, documentation, and implementation of service interfaces that adhere to the Society of Automotive Engineers' standard AS5684A, the JAUS Service Interface Design Language (JSIDL). JTS is designed to support the modeling, analysis, implementation, and testing of the protocol for an entire distributed system.