Symbol table

Last updated

In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier (or symbol), constant, procedure and function in a program's source code is associated with information relating to its declaration or appearance in the source. In other words, the entries of a symbol table store the information related to the entry's corresponding symbol. [1]

Contents

Background

A symbol table may only exist in memory during the translation process, or it may be embedded in the output of the translation, such as in an ABI object file for later use. For example, it might be used during an interactive debugging session, or as a resource for formatting a diagnostic report during or after execution of a program. [2]

Description

The minimum information contained in a symbol table used by a translator and intermediate representation (IR) includes the symbol's name and its location or address. For a compiler targeting a platform with a concept of relocatability, it will also contain relocatability attributes (absolute, relocatable, etc.) and needed relocation information for relocatable symbols. Symbol tables for high-level programming languages may store the symbol's type: string, integer, floating-point, etc., its size, and its dimensions and its bounds. Not all of this information is included in the output file, but may be provided for use in debugging. In many cases, the symbol's cross-reference information is stored with or linked to the symbol table. Most compilers print some or all of this information in symbol table and cross-reference listings at the end of translation. [1]

Implementation

Numerous data structures are available for implementing tables. Trees, linear lists and self-organizing lists can all be used to implement a symbol table. The symbol table is accessed by most phases of a compiler, beginning with lexical analysis, and continuing through optimization.

A compiler may use one large symbol table for all symbols or use separated, or hierarchical symbol tables for different scopes. For example, in a scoped language such as Algol or PL/I a symbol "p" can be declared separately in several procedures, perhaps with different attributes. The scope of each declaration is the section of the program in which references to "p" resolve to that declaration. Each declaration represents a unique identifier "p". The symbol table must have some means of differentiating references to the different "p"s.

A common data structure used to implement symbol tables is the hash table. The time for searching in hash tables is independent of the number of elements stored in the table, so it is efficient for a large number of elements. It also simplifies the classification of literals in tabular format by including the classification in calculation of the hash key. [3]

As the lexical analyser spends a great proportion of its time looking up the symbol table, this activity has a crucial effect on the overall speed of the compiler. A symbol table must be organised in such a way that entries can be found as quickly as possible. Hash tables are usually used to organise a symbol table, where the keyword or identifier is 'hashed' to produce an array subscript. Collisions are inevitable in a hash table, and a common way of handling them is to store the synonym in the next available free space in the table.

Applications

An object file will contain a symbol table of the identifiers it contains that are externally visible. During the linking of different object files, a linker will identify and resolve these symbol references. Usually all undefined external symbols will be searched for in one or more object libraries. If a module is found that defines that symbol it is linked together with the first object file, and any undefined external identifiers are added to the list of identifiers to be looked up. This process continues until all external references have been resolved. It is an error if one or more remains unresolved at the end of the process.

While reverse engineering an executable, many tools refer to the symbol table to check what addresses have been assigned to global variables and known functions. If the symbol table has been stripped or cleaned out before being converted into an executable, tools will find it harder to determine addresses or understand anything about the program.

Example

Consider the following program written in C:

// Declare an external functionexterndoublebar(doublex);// Define a public functiondoublefoo(intcount){doublesum=0.0;// Sum all the values bar(1) to bar(count)for(inti=1;i<=count;i++)sum+=bar((double)i);returnsum;}

A C compiler that parses this code will contain at least the following symbol table entries:

Symbol nameTypeScope
barfunction, doubleextern
xdoublefunction parameter
foofunction, doubleglobal
countintfunction parameter
sumdoubleblock local
iintfor-loop statement

In addition, the symbol table may also contain entries generated by the compiler for intermediate expression values (e.g., the expression that casts the i loop variable into a double, and the return value of the call to function bar()), statement labels, and so forth.

Example: SysV ABI

Example table: SysV ABI
AddressTypeName
00000020aT_BIT
00000040aF_BIT
00000080aI_BIT
20000004tirqvec
20000008tfiqvec
2000000ctInitReset
20000018T_main
20000024tEnd
20000030TAT91F_US3_CfgPIO_useB
2000005ctAT91F_PIO_CfgPeriph
200000b0T main
20000120TAT91F_DBGU_Printk
20000190tAT91F_US_TxReady
200001c0tAT91F_US_PutChar
200001f8TAT91F_SpuriousHandler
20000214TAT91F_DataAbort
20000230TAT91F_FetchAbort
2000024cTAT91F_Undef
20000268TAT91F_UndefHandler
20000284TAT91F_LowLevelInit
200002e0tAT91F_DBGU_CfgPIO
2000030ctAT91F_PIO_CfgPeriph
20000360tAT91F_US_Configure
200003dctAT91F_US_SetBaudrate
2000041ctAT91F_US_Baudrate
200004ectAT91F_US_SetTimeguard
2000051ctAT91F_PDC_Open
2000059ctAT91F_PDC_DisableRx
200005c8tAT91F_PDC_DisableTx
200005f4tAT91F_PDC_SetNextTx
20000638tAT91F_PDC_SetNextRx
2000067ctAT91F_PDC_SetTx
200006c0tAT91F_PDC_SetRx
20000704tAT91F_PDC_EnableRx
20000730tAT91F_PDC_EnableTx
2000075ctAT91F_US_EnableTx
20000788T__aeabi_uidiv
20000788T__udivsi3
20000884T__aeabi_uidivmod
2000089cT__aeabi_idiv0
2000089cT__aeabi_ldiv0
2000089cT__div0
200009a0D_data
200009a0A_etext
200009a4A__bss_end__
200009a4A__bss_start
200009a4A__bss_start__
200009a4A_edata
200009a4A_end

An example of a symbol table can be found in the SysV Application Binary Interface (ABI) specification, which mandates how symbols are to be laid out in a binary file, so that different compilers, linkers and loaders can all consistently find and work with the symbols in a compiled object.

The SysV ABI is implemented in the GNU binutils' nm utility. This format uses a sorted memory address field, a "symbol type" field, and a symbol identifier (called "Name"). [4]

The symbol types in the SysV ABI (and nm's output) indicate the nature of each entry in the symbol table. Each symbol type is represented by a single character. For example, symbol table entries representing initialized data are denoted by the character "d" and symbol table entries for functions have the symbol type "t" (because executable code is located in the text section of an object file). Additionally, the capitalization of the symbol type indicates the type of linkage: lower-case letters indicate the symbol is local and upper-case indicates external (global) linkage.

Example: the Python symbol table

The Python programming language includes extensive support for creating and manipulating symbol tables. [5] Properties that can be queried include whether a given symbol is a free variable or a bound variable, whether it is block scope or global scope, whether it is imported, and what namespace it belongs to.

Example: Dynamic symbol tables

Some programming languages allow the symbol table to be manipulated at run-time, so that symbols can be added at any time. Racket is an example of such a language. [6]

Both the LISP and the Scheme programming languages allow arbitrary, generic properties to be associated with each symbol. [7]

The Prolog programming language is essentially a symbol-table manipulation language; symbols are called atoms, and the relationships between symbols can be reasoned over. Similarly, OpenCog provides a dynamic symbol table, called the atomspace, which is used for knowledge representation.

See also

Related Research Articles

<span class="mw-page-title-main">Common Lisp</span> Programming language standard

Common Lisp (CL) is a dialect of the Lisp programming language, published in American National Standards Institute (ANSI) standard document ANSI INCITS 226-1994 (S20018). The Common Lisp HyperSpec, a hyperlinked HTML version, has been derived from the ANSI Common Lisp standard.

In computer programming, the scope of a name binding is the part of a program where the name binding is valid; that is, where the name can be used to refer to the entity. In other parts of the program, the name may refer to a different entity, or to nothing at all. Scope helps prevent name collisions by allowing the same name to refer to different objects – as long as the names have separate scopes. The scope of a name binding is also known as the visibility of an entity, particularly in older or more technical literature—this is from the perspective of the referenced entity, not the referencing name.

The Common Object File Format (COFF) is a format for executable, object code, and shared library computer files used on Unix systems. It was introduced in Unix System V, replaced the previously used a.out format, and formed the basis for extended specifications such as XCOFF and ECOFF, before being largely replaced by ELF, introduced with SVR4. COFF and its variants continue to be used on some Unix-like systems, on Microsoft Windows, in UEFI environments and in some embedded development systems.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

Mach-O, short for Mach object file format, is a file format for executables, object code, shared libraries, dynamically loaded code, and core dumps. It was developed to replace the a.out format.

In compiler construction, name mangling is a technique used to solve various problems caused by the need to resolve unique names for programming entities in many modern programming languages.

In computer programming, a sigil is a symbol affixed to a variable name, showing the variable's datatype or scope, usually a prefix, as in $foo, where $ is the sigil.

In computer programming, an entry point is the place in a program where the execution of a program begins, and where the program has access to command line arguments.

xHarbour is a free multi-platform extended Clipper compiler, offering multiple graphic terminals (GTs), including console drivers, GUIs, and hybrid console/GUIs. xHarbour is backward-compatible with Clipper and supports many language syntax extensions, greatly extended run-time libraries, and extensive third party support.

The C and C++ programming languages are closely related but have many significant differences. C++ began as a fork of an early, pre-standardized C, and was designed to be mostly source-and-link compatible with C compilers of the time. Due to this, development tools for the two languages are often integrated into a single product, with the programmer able to specify C or C++ as their source language.

Program database (PDB) is a file format for storing debugging information about a program. PDB files commonly have a .pdb extension. A PDB file is typically created from source files during compilation. It stores a list of all symbols in a module with their addresses and possibly the name of the file and the line on which the symbol was declared. This symbol information is not stored in the module itself, because it takes up a lot of space.

A debug symbol is a special kind of symbol that attaches additional information to the symbol table of an object file, such as a shared library or an executable. This information allows a symbolic debugger to gain access to information from the source code of the binary, such as the names of identifiers, including variables and routines.

In the C programming language, an external variable is a variable defined outside any function block. On the other hand, a local (automatic) variable is a variable defined inside a function block.

As an alternative to automatic variables, it is possible to define variables that are external to all functions, that is, variables that can be accessed by name by any function. Because external variables are globally accessible, they can be used instead of argument lists to communicate data between functions. Furthermore, because external variables remain in existence permanently, rather than appearing and disappearing as functions are called and exited, they retain their values even after the functions that set them have returned.

In programming languages, particularly the compiled ones like C, C++, and D, linkage describes how names can or can not refer to the same entity throughout the whole program or one single translation unit.

This article describes the syntax of the C# programming language. The features described are compatible with .NET Framework and Mono.

This comparison of programming languages compares how object-oriented programming languages such as C++, Java, Smalltalk, Object Pascal, Perl, Python, and others manipulate data structures.

In computer programming, a constant is a value that should not be altered by the program during normal execution, i.e., the value is constant. When associated with an identifier, a constant is said to be "named," although the terms "constant" and "named constant" are often used interchangeably. This is contrasted with a variable, which is an identifier with a value that can be changed during normal execution, i.e., the value is variable.

The Perl virtual machine is a stack-based process virtual machine implemented as an opcodes interpreter which runs previously compiled programs written in the Perl language. The opcodes interpreter is a part of the Perl interpreter, which also contains a compiler in one executable file, commonly /usr/bin/perl on various Unix-like systems or perl.exe on Microsoft Windows systems.

The OS/360 Object File Format is the standard object module file format for the IBM DOS/360, OS/360 and VM/370, Univac VS/9, and Fujitsu BS2000 mainframe operating systems. In the 1990s, the format was given an extension with the XSD-type record for the MVS Operating System to support longer module names in the C Programming Language. This format is still in use by the z/VSE operating system. In contrast, it has been superseded by the GOFF file format on the MVS Operating System and on the z/VM Operating System. Since the MVS and z/VM loaders will still handle this older format, some compilers have chosen to continue to produce this format instead of the newer GOFF format.

The GOFF specification was developed for IBM's MVS operating system to supersede the IBM OS/360 Object File Format to compensate for weaknesses in the older format.

References

  1. 1 2 Copper & Torczon 2011, p. 253.
  2. Nguyen, Binh (2004). Linux Dictionary. p. 1482. Retrieved Apr 14, 2018.
  3. Copper & Torczon 2011, p. 254.
  4. "nm". sourceware.org. Retrieved May 30, 2020.
  5. symtable — Python documentation
  6. Symbols - Racket Documentation
  7. Symbols - Guile Documentation

Bibliography