Printf format string

Last updated

An example of the printf function Printf.svg
An example of the printf function

The printf format string is a control parameter used by a class of functions in the input/output libraries of C and many other programming languages. The string is written in a simple template language: characters are usually copied literally into the function's output, but format specifiers, which start with a % character, indicate the location and method to translate a piece of data (such as a number) to characters.

Contents

"printf" is the name of one of the main C output functions, and stands for "print formatted". printf format strings are complementary to scanf format strings, which provide formatted input (lexing aka. parsing). In both cases these provide simple functionality and fixed format compared to more sophisticated and flexible template engines or lexers/parsers, but are sufficient for many purposes.

Many languages other than C copy the printf format string syntax closely or exactly in their own I/O functions.

Mismatches between the format specifiers and type of the data can cause crashes and other vulnerabilities. The format string itself is very often a string literal, which allows static analysis of the function call. However, it can also be the value of a variable, which allows for dynamic formatting but also a security vulnerability known as an uncontrolled format string exploit.

History

Early programming languages such as Fortran used special statements with completely different syntax from other calculations to build formatting descriptions. In this example, the format is specified on line 601, and the WRITE command refers to it by line number:

WRITE OUTPUTTAPE6,601,IA,IB,IC,AREA 601 FORMAT(4HA=,I5,5HB=,I5,5HC=,I5,&8HAREA=,F10.2,13HSQUAREUNITS)

ALGOL 68 had more function-like API, but still used special syntax (the $ delimiters surround special formatting syntax):

printf(($"Color "g", number1 "6d,", number2 "4zd,", hex "16r2d,", float "-d.2d,", unsigned value"-3d"."l$,"red",123456,89,BIN255,3.14,250));

But using the normal function calls and data types simplifies the language and compiler, and allows the implementation of the input/output to be written in the same language. These advantages outweigh the disadvantages (such as a complete lack of type safety in many instances) and in most newer languages I/O is not part of the syntax.

C's printf has its origins in BCPL's writef function (1966). In comparison to C and printf, *N is a BCPL language escape sequence representing a newline character (for which C uses the escape sequence \n) and the order of the format specification's field width and type is reversed in writef: [1]

WRITEF("%I2-QUEENS PROBLEM HAS %I5 SOLUTIONS*N", NUMQUEENS, COUNT) 

Probably the first copying of the syntax outside the C language was the Unix printf shell command, which first appeared in Version 4, as part of the port to C. [2]

Format placeholder specification

Formatting takes place via placeholders within the format string. For example, if a program wanted to print out a person's age, it could present the output by prefixing it with "Your age is ", and using the signed decimal specifier character d to denote that we want the integer for the age to be shown immediately after that message, we may use the format string:

printf("Your age is %d",age);

Syntax

The syntax for a format placeholder is

%[''parameter''][''flags''][''width''][.''precision''][''length'']''type''

Parameter field

This is a POSIX extension and not in C99. The Parameter field can be omitted or can be:

CharacterDescription
n$n is the number of the parameter to display using this format specifier, allowing the parameters provided to be output multiple times, using varying format specifiers or in different orders. If any single placeholder specifies a parameter, all the rest of the placeholders MUST also specify a parameter.
For example, printf("%2$d %2$#x; %1$d %1$#x",16,17) produces 17 0x11; 16 0x10.

This feature mainly sees its use in localization, where the order of occurrence of parameters vary due to the language-dependent convention.

On the non-POSIX Microsoft Windows, support for this feature is placed in a separate printf_p function.

Flags field

The Flags field can be zero or more (in any order) of:

CharacterDescription
-
(minus)
Left-align the output of this placeholder. (The default is to right-align the output.)
+
(plus)
Prepends a plus for positive signed-numeric types. positive = +, negative = -.
(The default doesn't prepend anything in front of positive numbers.)

(space)
Prepends a space for positive signed-numeric types. positive = , negative = -. This flag is ignored if the + flag exists.
(The default doesn't prepend anything in front of positive numbers.)
0
(zero)
When the 'width' option is specified, prepends zeros for numeric types. (The default prepends spaces.)
For example, printf("%4X",3) produces 3, while printf("%04X",3) produces 0003.
'
(apostrophe)
The integer or exponent of a decimal has the thousands grouping separator applied.
#
(hash)
Alternate form:
For g and G types, trailing zeros are not removed.
For f, F, e, E, g, G types, the output always contains a decimal point.
For o, x, X types, the text 0, 0x, 0X, respectively, is prepended to non-zero numbers.

Width field

The Width field specifies a minimum number of characters to output and is typically used to pad fixed-width fields in tabulated output, where the fields would otherwise be smaller, although it does not cause truncation of oversized fields.

The width field may be omitted, or a numeric integer value, or a dynamic value when passed as another argument when indicated by an asterisk *. [3] For example, printf("%*d", 5, 10) will result in 10 being printed, with a total width of 5 characters.

Though not part of the width field, a leading zero is interpreted as the zero-padding flag mentioned above, and a negative value is treated as the positive value in conjunction with the left-alignment - flag also mentioned above.

Precision field

The Precision field usually specifies a maximum limit on the output, depending on the particular formatting type. For floating-point numeric types, it specifies the number of digits to the right of the decimal point that the output should be rounded. For the string type, it limits the number of characters that should be output, after which the string is truncated.

The precision field may be omitted, or a numeric integer value, or a dynamic value when passed as another argument when indicated by an asterisk *. For example, printf("%.*s", 3, "abcdef") will result in abc being printed.

Length field

The Length field can be omitted or be any of:

CharacterDescription
hhFor integer types, causes printf to expect an int-sized integer argument which was promoted from a char.
hFor integer types, causes printf to expect an int-sized integer argument which was promoted from a short.
lFor integer types, causes printf to expect a long-sized integer argument.

For floating-point types, this is ignored. float arguments are always promoted to double when used in a varargs call. [4]

llFor integer types, causes printf to expect a long long-sized integer argument.
LFor floating-point types, causes printf to expect a long double argument.
zFor integer types, causes printf to expect a size_t-sized integer argument.
jFor integer types, causes printf to expect a intmax_t-sized integer argument.
tFor integer types, causes printf to expect a ptrdiff_t-sized integer argument.

Additionally, several platform-specific length options came to exist prior to widespread use of the ISO C99 extensions:

CharactersDescription
IFor signed integer types, causes printf to expect ptrdiff_t-sized integer argument; for unsigned integer types, causes printf to expect size_t-sized integer argument. Commonly found in Win32/Win64 platforms.
I32For integer types, causes printf to expect a 32-bit (double word) integer argument. Commonly found in Win32/Win64 platforms.
I64For integer types, causes printf to expect a 64-bit (quad word) integer argument. Commonly found in Win32/Win64 platforms.
qFor integer types, causes printf to expect a 64-bit (quad word) integer argument. Commonly found in BSD platforms.

ISO C99 includes the inttypes.h header file that includes a number of macros for use in platform-independent printf coding. These must be outside double-quotes, e.g. printf("%" PRId64 "\n", t);

Example macros include:

MacroDescription
PRId32Typically equivalent to I32d (Win32/Win64) or d
PRId64Typically equivalent to I64d (Win32/Win64), lld (32-bit platforms) or ld (64-bit platforms)
PRIi32Typically equivalent to I32i (Win32/Win64) or i
PRIi64Typically equivalent to I64i (Win32/Win64), lli (32-bit platforms) or li (64-bit platforms)
PRIu32Typically equivalent to I32u (Win32/Win64) or u
PRIu64Typically equivalent to I64u (Win32/Win64), llu (32-bit platforms) or lu (64-bit platforms)
PRIx32Typically equivalent to I32x (Win32/Win64) or x
PRIx64Typically equivalent to I64x (Win32/Win64), llx (32-bit platforms) or lx (64-bit platforms)

Type field

The Type field can be any of:

CharacterDescription
%Prints a literal % character (this type doesn't accept any flags, width, precision, length fields).
d, iint as a signed integer. %d and %i are synonymous for output, but are different when used with scanf for input (where using %i will interpret a number as hexadecimal if it's preceded by 0x, and octal if it's preceded by 0.)
uPrint decimal unsigned int.
f, Fdouble in normal (fixed-point) notation. f and F only differs in how the strings for an infinite number or NaN are printed (inf, infinity and nan for f; INF, INFINITY and NAN for F).
e, Edouble value in standard form (d.ddddd). An E conversion uses the letter E (rather than e) to introduce the exponent. The exponent always contains at least two digits; if the value is zero, the exponent is 00. In Windows, the exponent contains three digits by default, e.g. 1.5e002, but this can be altered by Microsoft-specific _set_output_format function.
g, Gdouble in either normal or exponential notation, whichever is more appropriate for its magnitude. g uses lower-case letters, G uses upper-case letters. This type differs slightly from fixed-point notation in that insignificant zeroes to the right of the decimal point are not included. Also, the decimal point is not included on whole numbers.
x, Xunsigned int as a hexadecimal number. x uses lower-case letters and X uses upper-case.
ounsigned int in octal.
s null-terminated string.
cchar (character).
pvoid* (pointer to void) in an implementation-defined format.
a, Adouble in hexadecimal notation, starting with 0x or 0X. a uses lower-case letters, A uses upper-case letters. [5] [6] (C++11 iostreams have a hexfloat that works the same).
nPrint nothing, but writes the number of characters written so far into an integer pointer parameter.
In Java this prints a newline. [7]

Custom format placeholders

There are a few implementations of printf-like functions that allow extensions to the escape-character-based mini-language, thus allowing the programmer to have a specific formatting function for non-builtin types. One of the most well-known is the (now deprecated) glibc's register_printf_function(). However, it is rarely used due to the fact that it conflicts with static format string checking. Another is Vstr custom formatters, which allows adding multi-character format names.

Some applications (like the Apache HTTP Server) include their own printf-like function, and embed extensions into it. However these all tend to have the same problems that register_printf_function() has.

The Linux kernel printk function supports a number of ways to display kernel structures using the generic %p specification, by appending additional format characters. [8] For example, %pI4 prints an IPv4 address in dotted-decimal form. This allows static format string checking (of the %p portion) at the expense of full compatibility with normal printf.

Most languages that have a printf-like function work around the lack of this feature by just using the %s format and converting the object to a string representation.

Vulnerabilities

Invalid conversion specifications

If there are too few function arguments provided to supply values for all the conversion specifications in the template string, or if the arguments are not of the correct types, the results are undefined, may crash. Implementations are inconsistent about whether syntax errors in the string consume an argument and what type of argument they consume. Excess arguments are ignored. In a number of cases, the undefined behavior has led to "Format string attack" security vulnerabilities. In most C or C++ calling conventions arguments may be passed on the stack, which means in the case of too few arguments printf will read past the end of the current stackframe, thus allowing the attacker to read the stack.

Some compilers, like the GNU Compiler Collection, will statically check the format strings of printf-like functions and warn about problems (when using the flags -Wall or -Wformat). GCC will also warn about user-defined printf-style functions if the non-standard "format" __attribute__ is applied to the function.

Field width versus explicit delimiters in tabular output

Using only field widths to provide for tabulation, as with a format like %8d%8d%8d for three integers in three 8-character columns, will not guarantee that field separation will be retained if large numbers occur in the data:

1234567 1234567 1234567  123     123     123      123     12345678123     

Loss of field separation can easily lead to corrupt output. In systems which encourage the use of programs as building blocks in scripts, such corrupt data can often be forwarded into and corrupt further processing, regardless of whether the original programmer expected the output would only be read by human eyes. Such problems can be eliminated by including explicit delimiters, even spaces, in all tabular output formats. Simply changing the dangerous example from before to %7d %7d %7d addresses this, formatting identically until numbers become larger, but then explicitly preventing them from becoming merged on output due to the explicitly included spaces:

1234567 1234567 1234567  123     123     123      123     12345678 123     

Similar strategies apply to string data.

Memory write

Although an outputting function on the surface, printf allows writing to a memory location specified by an argument via %n. This functionality is occasionally used as a part of more elaborate format-string attacks. [9]

The %n functionality also makes printf accidentally Turing-complete even with a well-formed set of arguments. A game of tic-tac-toe written in the format string is a winner of the 27th IOCCC. [10]

Programming languages with printf

Not included in this list are languages that use format strings that deviate from the style in this article (such as AMPL and Elixir), languages that inherit their implementation from the JVM or other environment (such as Clojure and Scala), and languages that do not have a standard native printf implementation but have external libraries which emulate printf behavior (such as JavaScript).

See also

Related Research Articles

<span class="mw-page-title-main">AWK</span> Data-driven programming language made by Alfred Aho, Peter Weinberger and Brian Kernighan

AWK (awk) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, protocol stacks, though decreasingly for application software. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

<span class="mw-page-title-main">"Hello, World!" program</span> Traditional beginners computer program

A "Hello, World!" program is generally a computer program that ignores any input and outputs or displays a message similar to "Hello, World!". A small piece of code in most general-purpose programming languages, this program is used to illustrate a language's basic syntax. "Hello, World!" programs are often the first a student learns to write in a given language, and they can also be used as a sanity check to ensure computer software intended to compile or run source code is correctly installed, and that its operator understands how to use it.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

The C preprocessor is the macro preprocessor for the C, Objective-C and C++ computer programming languages. The preprocessor provides the ability for the inclusion of header files, macro expansions, conditional compilation, and line control.

Pretty-printing is the application of any of various stylistic formatting conventions to text files, such as source code, markup, and similar kinds of content. These formatting conventions may entail adhering to an indentation style, using different color and typeface to highlight syntactic elements of source code, or adjusting size, to make the content easier for people to read, and understand. Pretty-printers for source code are sometimes called code formatters or beautifiers.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

Uncontrolled format string is a type of software vulnerability discovered around 1989 that can be used in security exploits. Originally thought harmless, format string exploits can be used to crash a program or to execute harmful code. The problem stems from the use of unchecked user input as the format string parameter in certain C functions that perform formatting, such as printf . A malicious user may use the %s and %x format tokens, among others, to print data from the call stack or possibly other locations in memory. One may also write arbitrary data to arbitrary locations using the %n format token, which commands printf and similar functions to write the number of bytes formatted to an address stored on the stack.

<span class="mw-page-title-main">C data types</span> Data types supported by the C programming language

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

Format is a function in Common Lisp that can produce formatted text using a format string similar to the printf format string. It provides more functionality than printf, allowing the user to output numbers in English, apply certain format specifiers only under certain conditions, iterate over data structures, and output in a tabular format. This functionally originates in MIT's Lisp Machine Lisp, where it was based on Multics ioa_.

A scanf format string is a control parameter used in various functions to specify the layout of an input string. The functions can then divide the string and translate into values of appropriate data types. String scanning functions are often supplied in standard libraries.Scanf is a function that reads formatted data from the standard input string, which is usually the keyboard and writes the results whenever called in the specified arguments.

stdarg.h is a header in the C standard library of the C programming language that allows functions to accept an indefinite number of arguments. It provides facilities for stepping through a list of function arguments of unknown number and type. C++ provides this functionality in the header cstdarg.

This article compares a large number of programming languages by tabulating their data types, their expression, statement, and declaration syntax, and some common operating-system interfaces.

In computer programming, variadic templates are templates that take a variable number of arguments.

Getopt is a C library function used to parse command-line options of the Unix/POSIX style. It is a part of the POSIX specification, and is universal to Unix-like systems. It is also the name of a Unix program for parsing command line arguments in shell scripts.

In Unix and Unix-like operating systems, printf is a shell builtin that formats and prints data.

In computer programming, string interpolation is the process of evaluating a string literal containing one or more placeholders, yielding a result in which the placeholders are replaced with their corresponding values. It is a form of simple template processing or, in formal terms, a form of quasi-quotation. The placeholder may be a variable name, or in some languages an arbitrary expression, in either case evaluated in the current context.

In computer programming, ellipsis notation is used to denote ranges, an unspecified number of arguments, or a parent directory. Most programming languages require the ellipsis to be written as a series of periods; a single (Unicode) ellipsis character cannot be used.

printk is a C function from the Linux kernel interface that prints messages to the kernel log. It accepts a string parameter called the format string, which specifies a method for rendering an arbitrary number of varied data type parameter(s) into a string. The string is then printed to the kernel log.

The write is one of the most basic routines provided by a Unix-like operating system kernel. It writes data from a buffer declared by the user to a given device, such as a file. This is the primary way to output data from a program by directly using a system call. The destination is identified by a numeric code. The data to be written, for instance a piece of text, is defined by a pointer and a size, given in number of bytes.

References

  1. "BCPL". cl.cam.ac.uk. Retrieved 19 March 2018.
  2. McIlroy, M. D. (1987). A Research Unix reader: annotated excerpts from the Programmer's Manual, 1971–1986 (PDF) (Technical report). CSTR. Bell Labs. 139.
  3. "printf". cplusplus.com. Retrieved 10 June 2020.
  4. ISO/IEC (1999). ISO/IEC 9899:1999(E): Programming Languages – C §7.19.6.1 para 7.
  5. ""The GNU C Library Reference Manual", "12.12.3 Table of Output Conversions"". Gnu.org. Retrieved 17 March 2014.
  6. "printf" (%a added in C99)
  7. "Formatting Numeric Print Output". The Java Tutorials. Oracle Inc. Retrieved 19 March 2018.
  8. "Linux kernel Documentation/printk-formats.txt". Git.kernel.org. Retrieved 17 March 2014.
  9. https://www.exploit-db.com/docs/english/28476-linux-format-string-exploitation.pdf [ bare URL PDF ]
  10. "Best of show – abuse of libc". Ioccc.org. Retrieved 5 May 2022.
  11. ""The Open Group Base Specifications Issue 7, 2018 edition", "POSIX awk", "Output Statements"". pubs.opengroup.org. Retrieved 29 May 2022.
  12. "Printf Standard Library". The Julia Language Manual. Retrieved 22 February 2021.
  13. "Built-in Types: printf-style String Formatting", The Python Standard Library, Python Software Foundation, retrieved 24 February 2021