Integer literal

Last updated

In computer science, an integer literal is a kind of literal for an integer whose value is directly represented in source code. For example, in the assignment statement x = 1, the string 1 is an integer literal indicating the value 1, while in the statement x = 0x10 the string 0x10 is an integer literal indicating the value 16, which is represented by 10 in hexadecimal (indicated by the 0x prefix).

Contents

By contrast, in x = cos(0), the expression cos(0) evaluates to 1 (as the cosine of 0), but the value 1 is not literally included in the source code. More simply, in x = 2 + 2, the expression 2 + 2 evaluates to 4, but the value 4 is not literally included. Further, in x = "1" the "1" is a string literal, not an integer literal, because it is in quotes. The value of the string is 1, which happens to be an integer string, but this is semantic analysis of the string literal – at the syntactic level "1" is simply a string, no different from "foo".

Parsing

Recognizing a string (sequence of characters in the source code) as an integer literal is part of the lexical analysis (lexing) phase, while evaluating the literal to its value is part of the semantic analysis phase. Within the lexer and phrase grammar, the token class is often denoted integer, with the lowercase indicating a lexical-level token class, as opposed to phrase-level production rule (such as ListOfIntegers). Once a string has been lexed (tokenized) as an integer literal, its value cannot be determined syntactically (it is just an integer), and evaluation of its value becomes a semantic question.

Integer literals are generally lexed with regular expressions, as in Python. [1]

Evaluation

As with other literals, integer literals are generally evaluated at compile time, as part of the semantic analysis phase. In some cases this semantic analysis is done in the lexer, immediately on recognition of an integer literal, while in other cases this is deferred until the parsing stage, or until after the parse tree has been completely constructed. For example, on recognizing the string 0x10 the lexer could immediately evaluate this to 16 and store that (a token of type integer and value 16), or defer evaluation and instead record a token of type integer and value 0x10.

Once literals have been evaluated, further semantic analysis in the form of constant folding is possible, meaning that literal expressions involving literal values can be evaluated at the compile phase. For example, in the statement x = 2 + 2 after the literals have been evaluated and the expression 2 + 2 has been parsed, it can then be evaluated to 4, though the value 4 does not itself appear as a literal.

Affixes

Integer literals frequently have prefixes indicating base, and less frequently suffixes indicating type. [1] For example, in C++ 0x10ULL indicates the value 16 (because hexadecimal) as an unsigned long long integer.

Common prefixes include:

Common suffixes include:

These affixes are somewhat similar to sigils, though sigils attach to identifiers (names), not literals.

Digit separators

In some languages, integer literals may contain digit separators to allow digit grouping into more legible forms. If this is available, it can usually be done for floating point literals as well. This is particularly useful for bit fields and makes it easier to see the size of large numbers (such as a million) at a glance by subitizing rather than counting digits. It is also useful for numbers that are typically grouped, such as credit card number or social security numbers. [lower-alpha 1] Very long numbers can be further grouped by doubling up separators.

Typically decimal numbers (base-10) are grouped in three digit groups (representing one of 1000 possible values), binary numbers (base-2) in four digit groups (one nibble, representing one of 16 possible values), and hexadecimal numbers (base-16) in two digit groups (each digit is one nibble, so two digits are one byte, representing one of 256 possible values). Numbers from other systems (such as id numbers) are grouped following whatever convention is in use.

Examples

In Ada, [2] [3] C# (from version 7.0), D, Eiffel, Go (from version 1.13), [4] Haskell (from GHC version 8.6.1), [5] Java (from version 7), [6] Julia, Perl, Python (from version 3.6), [7] Ruby, Rust [8] and Swift, [9] integer literals and float literals can be separated with an underscore (_). There can be some restrictions on placement; for example, in Java they cannot appear at the start or end of the literal, nor next to a decimal point. While the period, comma, and (thin) spaces are used in normal writing for digit separation, these conflict with their existing use in programming languages as radix point, list separator (and in C/C++, the comma operator), and token separator.

Examples include:

intoneMillion=1_000_000;intcreditCardNumber=1234_5678_9012_3456;intsocialSecurityNumber=123_45_6789;

In C++14 (2014) and the next version of C as of 2022, C23, the apostrophe character may be used to separate digits arbitrarily in numeric literals. [10] [11] The underscore was initially proposed, with an initial proposal in 1993, [12] and again for C++11, [13] following other languages. However, this caused conflict with user-defined literals, so the apostrophe was proposed instead, as an "upper comma" (which is used in some other contexts). [14] [15]

autointeger_literal=1'000'000;autobinary_literal=0b0100'1100'0110;autovery_long_binary_literal=0b0000'0001'0010'0011''0100'0101'0110'0111;

Notes

  1. Typically sensitive numbers such as these would not be included as literals, however.

Related Research Articles

<span class="mw-page-title-main">Decimal</span> Number in base-10 numeral system

The decimal numeral system is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numeral system. The way of denoting numbers in the decimal system is often referred to as decimal notation.

In mathematics and computing, the hexadecimal numeral system is a positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbols, hexadecimal uses sixteen distinct symbols, most often the symbols "0"–"9" to represent values 0 to 9 and "A"–"F" to represent values from ten to fifteen.

In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are commonly represented in a computer as a group of binary digits (bits). The size of the grouping varies so the set of integer sizes available varies between different types of computers. Computer hardware nearly always provides a way to represent a processor register or memory address as an integer.

Octal is a numeral system with eight as the base.

In a computer language, a reserved word is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a reserved word may have no user-defined meaning.

<span class="mw-page-title-main">Decimal separator</span> Numerical symbol

A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form. Different countries officially designate different symbols for use as the separator. The choice of symbol also affects the choice of symbol for the thousands separator used in digit grouping.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in Large language models (LLMs), but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Lex is a computer program that generates lexical analyzers. It is commonly used with the yacc parser generator and is the standard lexical analyzer generator on many Unix and Unix-like systems. An equivalent tool is specified as part of the POSIX standard.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

<span class="mw-page-title-main">Positional notation</span> Method for representing or encoding numbers

Positional notation usually denotes the extension to any base of the Hindu–Arabic numeral system. More generally, a positional system is a numeral system in which the contribution of a digit to the value of a number is the value of the digit multiplied by a factor determined by the position of the digit. In early numeral systems, such as Roman numerals, a digit has only one value: I means one, X means ten and C a hundred. In modern positional systems, such as the decimal system, the position of the digit means that its value must be multiplied by some value: in 555, the three identical symbols represent five hundreds, five tens, and five units, respectively, due to their different positions in the digit string.

In computer programming, operators are constructs defined within programming languages which behave generally like functions, but which differ syntactically or semantically.

In computer programming, a sigil is a symbol affixed to a variable name, showing the variable's datatype or scope, usually a prefix, as in $foo, where $ is the sigil.

In computer science, a literal is a textual representation (notation) of a value as it is written in source code. Almost all programming languages have notations for atomic values such as integers, floating-point numbers, and strings, and usually for booleans and characters; some also have notations for elements of enumerated types and compound values such as arrays, records, and objects. An anonymous function is a literal for the function type.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer language design, stropping is a method of explicitly marking letter sequences as having a special property, such as being a keyword, or a certain type of variable or storage location, and thus inhabiting a different namespace from ordinary names ("identifiers"), in order to avoid clashes. Stropping is not used in most modern languages – instead, keywords are reserved words and cannot be used as identifiers. Stropping allows the same letter sequence to be used both as a keyword and as an identifier, and simplifies parsing in that case – for example allowing a variable named if without clashing with the keyword if.

<span class="mw-page-title-main">JavaScript syntax</span> Set of rules defining correctly structured programs

The syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.

In computer science, a lexical grammar or lexical structure is a formal grammar defining the syntax of tokens. The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent to the alphabet used by any written language. The lexical grammar lays down the rules governing how a character sequence is divided up into subsequences of characters, each part of which represents an individual token. This is frequently defined in terms of regular expressions.

C++14 is a version of the ISO/IEC 14882 standard for the C++ programming language. It is intended to be a small extension over C++11, featuring mainly bug fixes and small improvements, and was replaced by C++17. Its approval was announced on August 18, 2014. C++14 was published as ISO/IEC 14882:2014 in December 2014.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In computer programming languages, an identifier is a lexical token that names the language's entities. Some of the kinds of entities an identifier might denote include variables, data types, labels, subroutines, and modules.

References

  1. 1 2 "2.4.4. Integer and long integer literals"
  2. "Ada '83 Language Reference Manual: 2.4. Numeric Literals".
  3. ""Rationale for the Design of the Ada® Programming Language": 2.1 Lexical Structure".
  4. "Go 1.13 Release Notes - Changes to the language" . Retrieved 2020-11-05.
  5. "Glasgow Haskell Compiler User's Guide: 11.3.7. Numeric underscores" . Retrieved 2019-01-31.
  6. "Underscores in Numeric Literals" . Retrieved 2015-08-12.
  7. "What's New In Python 3.6".
  8. "Literals and operators" . Retrieved 2019-11-15.
  9. "The Swift Programming Language: Lexical Structure".
  10. Crowl, Lawrence; Smith, Richard; Snyder, Jeff; Vandevoorde, Daveed (25 September 2013). "N3781 Single-Quotation-Mark as a Digit Separator" (PDF).
  11. Aaron Ballman (2020-12-15). "N2626: Digit separators" (PDF).
  12. John Max Skaller (March 26, 1993). "N0259: A Proposal to allow Binary Literals, and some other small changes to Chapter 2: Lexical Conventions" (PDF).
  13. Crowl, Lawrence (2007-05-02). "N2281: Digit Separators".
  14. Vandevoorde, Daveed (2012-09-21). "N3448: Painless Digit Separation" (PDF).
  15. Crowl, Lawrence (2012-12-19). "N3499: Digit Separators".