Hollerith constant

Last updated

Hollerith constants, named in honor of Herman Hollerith, were used in early FORTRAN programs to allow manipulation of character data.

Herman Hollerith American statistician and inventor

Herman Hollerith was an American inventor who developed an electromechanical punched card tabulator to assist in summarizing information and, later, accounting. He was the founder of the Tabulating Machine Company that was amalgamated in 1911 with three other companies to form a fifth company, the Computing-Tabulating-Recording Company, which was renamed IBM in 1924. Hollerith is regarded as one of the seminal figures in the development of data processing. His invention of the punched card tabulating machine marks the beginning of the era of semiautomatic data processing systems, and his concept dominated that landscape for nearly a century.

Contents

Early FORTRAN had no CHARACTER data type, only numeric types. In order to perform character manipulation, characters needed to be placed into numeric variables using Hollerith constants. For example, the constant 3HABC specified a three-character string "ABC", identified by the initial integer representing the string length 3 and the specified Hollerith character H, followed by the string data ABC. These constants were typeless , so that there were no type conversion issues. If the constant specified fewer characters than was possible to hold in a data item, the characters were then stored in the item left-justified and blank-filled.

Data type classification of data in computer science

In computer science and computer programming, a data type or simply type is an attribute of data which tells the compiler or interpreter how the programmer intends to use the data. Most programming languages support common data types of real, integer and boolean. A data type constrains the values that an expression, such as a variable or a function, might take. This data type defines the operations that can be done on the data, the meaning of the data, and the way values of that type can be stored. A type of value from which an expression may take its value.

In computer science, type conversion, type casting, type coercion, and type juggling are different ways of changing an expression from one data type to another. An example would be the conversion of an integer value into a floating point value or its textual representation as a string, and vice versa. Type conversions can take advantage of certain features of type hierarchies or data representations. Two important aspects of a type conversion are whether it happens implicitly (automatically) or explicitly, and whether the underlying data representation is converted from one representation into another, or a given representation is merely reinterpreted as the representation of another data type. In general, both primitive and compound data types can be converted.

Mechanics

By the FORTRAN 66 Standard, Hollerith syntax was allowed in the following uses:

Portability was problematic with Hollerith Constants. First, word sizes varied on different computer systems, so the number of characters that could be placed in each data item likewise varied. Implementations varied from as few as two to as many as ten characters per word. Second, it was difficult to manipulate individual characters within a word in a portable fashion. This led to a great deal of shifting and masking code using non-standard, vendor-specific, features. The fact that character sets varied between machines also complicated the issue.

Some authors were of the opinion that for best portability, only a single character should be used per data item. However considering the small memory sizes of machines of the day, this technique was considered extremely wasteful.

Technological obsolescence

One of the major features of FORTRAN 77 was the CHARACTER string data type. Use of this data type dramatically simplified character manipulation in Fortran programs  rendering almost all uses of the Hollerith constant technique obsolete.

Hollerith constants were removed from the FORTRAN 77 Standard, though still described in an appendix for those wishing to continue support. Hollerith edit descriptors were allowed through Fortran 90, and were removed from the Fortran 95 Standard.

Examples

The following is a FORTRAN 66 hello world program using Hollerith constants. It assumes that at least four characters per word are supported by the implementation:

PROGRAM HELLO1CINTEGER IHWSTR(3)DATA IHWSTR/4HHELL,4HOWO,3HRLD/CWRITE(6,100)IHWSTRSTOP  100FORMAT(3A4)END

Besides DATA statements, Hollerith constants were also allowed as actual arguments in subroutine calls. However, there was no way that the callee could know how many characters were passed in. The programmer had to pass the information explicitly. The hello world program could be written as follows  on a machine where four characters are stored in a word:

PROGRAM HELLO2       CALL WRTOUT (11HHELLO WORLD, 11)       STOPEND C       SUBROUTINE WRTOUT (IARRAY, NCHRS) C       INTEGER IARRAY(1) [notes 1]        INTEGER NCHRS C       INTEGER ICPW       DATA ICPW/4/ [notes 2]        INTEGER I, NWRDS C       NWRDS = (NCHRS + ICPW - 1) /ICPW       WRITE (6,100) (IARRAY(I), I=1,NWRDS)       RETURN   100 FORMAT (100A4) [notes 3] END

Although technically not a Hollerith constant, the same Hollerith syntax was allowed as an edit descriptor in FORMAT statements. The hello world program could also be written as:

PROGRAM HELLO3WRITE(6,100)STOP  100FORMAT(11HHELLOWORLD)END

One of the most surprising features was the behaviour of Hollerith edit descriptors when used for input. The following program would change at run time HELLO WORLD to whatever would happen to be the next eleven characters in the input stream and print that input:

PROGRAM WHAT1READ(5,100)WRITE(6,100)STOP  100FORMAT(11HHELLOWORLD)END

Notes

  1. FORTRAN 66 did not have a way to indicate a variable-sized array. So a '1' was typically used to indicate that the size is unknown.
  2. Four characters per word.
  3. A count of 100 is a 'large enough' value that any reasonable number of characters can be written. Also note that four characters per word is hard-coded here too.

Related Research Articles

BASIC programming language

BASIC is a family of general-purpose, high-level programming languages whose design philosophy emphasizes ease of use. In 1964, John G. Kemeny and Thomas E. Kurtz designed the original BASIC language at Dartmouth College. They wanted to enable students in fields other than science and mathematics to use computers. At the time, nearly all use of computers required writing custom software, which was something only scientists and mathematicians tended to learn.

Fortran General-purpose programming language

Fortran is a general-purpose, compiled imperative programming language that is especially suited to numeric computation and scientific computing.

In computer programming, standard streams are preconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin), standard output (stdout) and standard error (stderr). Originally I/O happened via a physically connected system console, but standard streams abstract this. When a command is executed via an interactive shell, the streams are typically connected to the text terminal on which the shell is running, but can be changed with redirection or a pipeline. More generally, a child process inherits the standard streams of its parent process.

The C programming language provides many standard library functions for file input and output. These functions make up the bulk of the C standard library header <stdio.h>. The functionality descends from a "portable I/O package" written by Mike Lesk at Bell Labs in the early 1970s, and officially became part of the Unix operating system in Version 7.

In computer programming, a parameter or a formal argument, is a special kind of variable, used in a subroutine to refer to one of the pieces of data provided as input to the subroutine. These pieces of data are the values of the arguments with which the subroutine is going to be called/invoked. An ordered list of parameters is usually included in the definition of a subroutine, so that, each time the subroutine is called, its arguments for that call are evaluated, and the resulting values can be assigned to the corresponding parameters.

In computer science, primitive data type is either of the following:

The syntax of the C programming language, the rules governing writing of software in the language, is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

TI-BASIC is the official name of a BASIC-like language built into Texas Instruments (TI)'s graphing calculators. TI-BASIC is a language family, in fact 3 different and not compatible versions of TI-basic languages exist among the TI calculator products:

Coarray Fortran (CAF), formerly known as F--, started as an extension of Fortran 95/2003 for parallel processing created by Robert Numrich and John Reid in the 1990s. The Fortran 2008 standard now includes coarrays, as decided at the May 2005 meeting of the ISO Fortran Committee; the syntax in the Fortran 2008 standard is slightly different from the original CAF proposal.

In computer science, the Boolean data type is a data type that has one of two possible values, intended to represent the two truth values of logic and Boolean algebra. It is named after George Boole, who first defined an algebraic system of logic in the mid 19th century. The Boolean data type is primarily associated with conditional statements, which allow different actions by changing control flow depending on whether a programmer-specified Boolean condition evaluates to true or false. It is a special case of a more general logical data type —logic doesn't always need to be Boolean.

CMS-2 (programming language) embedded systems programming language

CMS-2 is an embedded systems programming language used by the United States Navy. It was an early attempt to develop a standardized high-level computer programming language intended to improve code portability and reusability. CMS-2 was developed primarily for the US Navy’s tactical data systems (NTDS).

F is a modular, compiled, numeric programming language, designed for scientific programming and scientific computation. F was developed as a modern Fortran, thus making it a subset of Fortran 95. It combines both numerical and data abstraction features from these languages. F is also backwards compatible with Fortran 77, allowing calls to Fortran 77 programs. F was first included in the g95 compiler.

In computer programming, a sigil is a symbol affixed to a variable name, showing the variable's datatype or scope, usually a prefix, as in $foo, where $ is the sigil.

IP Pascal is an implementation of the Pascal programming language using the IP portability platform, a multiple machine, operating system and language implementation system.

XL stands for eXtensible Language. It is the first and so far the only computer programming language designed to support concept programming.

scanf format string refers to a control parameter used by a class of functions in the string-processing libraries of various programming languages. The format string specifies a method for reading a string into an arbitrary number of varied data type parameter(s). The input string is by default read from the standard input, but variants exist that read the input from other sources.

This is an overview of Fortran 95 language features. Included are the additional features of TR-15581:Enhanced Data Type Facilities, that have been universally implemented. Old features that have been superseded by new ones are not described — few of those historic features are used in modern programs although most have been retained in the language to maintain backward compatibility. Although the current standard is Fortran 2008, even many of those features first introduced into Fortran 2003 are still being implemented. The additional features of Fortran 2003 and Fortran 2008 are described by Metcalf, Reid and Cohen.

The Mouse programming language is a small computer programming language developed by Dr. Peter Grogono in the late 1970s and early 1980s. It was developed as an extension of an earlier language called MUSYS, which was used to control digital and analog devices in an electronic music studio.

Comparison of programming languages is a common topic of discussion among software engineers. Basic instructions of several programming languages are compared here.

The OS/360 Object File Format is the standard object module file format for the IBM DOS/360, OS/360 and VM/370, Univac VS/9, and Fujitsu BS2000 mainframe operating systems. In the 1990s, the format was given an extension with the XSD-type record for the MVS Operating System to support longer module names in the C Programming Language. This format is still in use by the z/VSE operating system. In contrast, it has been superseded by the GOFF file format on the MVS Operating System and on the z/VM Operating System. Since the MVS and z/VM loaders will still handle this older format, some compilers have chosen to continue to produce this format instead of the newer GOFF format.

References

    4.2.6 Hollerith Type. A Hollerith datum is a string of characters. This string may consist of any characters capable of representation in the processor. The blank character is a valid and significant character in a Hollerith datum.