Uninitialized variable

Last updated February 23, 2023

In computing, an uninitialized variable is a variable that is declared but is not set to a definite known value before it is used. It will have some value, but not a predictable one. As such, it is a programming error and a common source of bugs in software.

Example of the C language

A common assumption made by novice programmers is that all variables are set to a known value, such as zero, when they are declared. While this is true for many languages, it is not true for all of them, and so the potential for error is there. Languages such as C use stack space for variables, and the collection of variables allocated for a subroutine is known as a stack frame. While the computer will set aside the appropriate amount of space for the stack frame, it usually does so simply by adjusting the value of the stack pointer, and does not set the memory itself to any new state (typically out of efficiency concerns). Therefore, whatever contents of that memory at the time will appear as initial values of the variables which occupy those addresses.

Here's a simple example in C:

voidcount(void){intk,i;for(i=0;i<10;i++){k=k+1;}printf("%d",k);}

The final value of k is undefined. The answer that it must be 10 assumes that it started at zero, which may or may not be true. Note that in the example, the variable i is initialized to zero by the first clause of the for statement.

Another example can be when dealing with structs. In the code snippet below, we have a struct student which contains some variables describing the information about a student. The function register_student leaks memory contents because it fails to fully initialize the members of struct student new_student. If we take a closer look, in the beginning, age, semester and student_number are initialized. But the initialization of the first_name and last_name members are incorrect. This is because if the length of first_name and last_name character arrays are less than 16 bytes, during the strcpy,^[1] we fail to fully initialize the entire 16 bytes of memory reserved for each of these members. Hence after memcpy()'ing the resulted struct to output,^[2] we leak some stack memory to the caller.

structstudent{unsignedintage;unsignedintsemester;charfirst_name[16];charlast_name[16];unsignedintstudent_number;};intregister_student(structstudent*output,intage,char*first_name,char*last_name){// If any of these pointers are Null, we fail.if(!output||!first_name||!last_name){printf("Error!\n");return-1;}// We make sure the length of the strings are less than 16 bytes (including the null-byte)// in order to avoid overflowsif(strlen(first_name)>15||strlen(last_name)>15){printf("first_name and last_name cannot be longer than 16 characters!\n");return-1;}// Initializing the membersstructstudentnew_student;new_student.age=age;new_student.semester=1;new_student.student_number=get_new_student_number();strcpy(new_student.first_name,first_name);strcpy(new_student.last_name,last_name);//copying the result to outputmemcpy(output,&new_student,sizeof(structstudent));return0;}

In any case, even when a variable is implicitly initialized to a default value like 0, this is typically not the correct value. Initialized does not mean correct if the value is a default one. (However, default initialization to 0 is a right practice for pointers and arrays of pointers, since it makes them invalid before they are actually initialized to their correct value.) In C, variables with static storage duration that are not initialized explicitly are initialized to zero (or null, for pointers).^[3]

Not only are uninitialized variables a frequent cause of bugs, but this kind of bug is particularly serious because it may not be reproducible: for instance, a variable may remain uninitialized only in some branch of the program. In some cases, programs with uninitialized variables may even pass software tests.

Impacts

Uninitialized variables are powerful bugs since they can be exploited to leak arbitrary memory or to achieve arbitrary memory overwrite or to gain code execution, depending on the case. When exploiting a software which utilizes address space layout randomization (ASLR), it is often required to know the base address of the software in memory. Exploiting an uninitialized variable in a way to force the software to leak a pointer from its address space can be used to bypass ASLR.

Use in languages

Uninitialized variables are a particular problem in languages such as assembly language, C, and C++, which were designed for systems programming. The development of these languages involved a design philosophy in which conflicts between performance and safety were generally resolved in favor of performance. The programmer was given the burden of being aware of dangerous issues such as uninitialized variables.

In other languages, variables are often initialized to known values when created. Examples include:

VHDL initializes all standard variables into special 'U' value. It is used in simulation, for debugging, to let the user to know when the don't care initial values, through the multi-valued logic, affect the output.
Java does not have uninitialized variables. Fields of classes and objects that do not have an explicit initializer and elements of arrays are automatically initialized with the default value for their type (false for boolean, 0 for all numerical types, null for all reference types).^[4] Local variables in Java must be definitely assigned to before they are accessed, or it is a compile error.
Python initializes local variables to NULL (distinct from None) and raises an UnboundLocalError when such a variable is accessed before being (re)initialized to a valid value.
D initializes all variables unless explicitly specified by the programmer not to.

Even in languages where uninitialized variables are allowed, many compilers will attempt to identify the use of uninitialized variables and report them as compile-time errors. Some languages assist this task by offering constructs to handle the initializedness of variables; for example, C# has a special flavour of call-by-reference parameters to subroutines (specified as out instead of the usual ref), asserting that the variable is allowed to be uninitialized on entry but will be initizalized afterwards.

Related Research Articles

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, protocol stacks, though decreasingly for application software. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

The Cyclone programming language is intended to be a safe dialect of the C language. Cyclone is designed to avoid buffer overflows and other vulnerabilities that are possible in C programs, without losing the power and convenience of C as a tool for system programming.

In computing, a segmentation fault or access violation is a fault, or failure condition, raised by hardware with memory protection, notifying an operating system (OS) the software has attempted to access a restricted area of memory. On standard x86 computers, this is a form of general protection fault. The operating system kernel will, in response, usually perform some corrective action, generally passing the fault on to the offending process by sending the process a signal. Processes can in some cases install a custom signal handler, allowing them to recover on their own, but otherwise the OS default signal handler is used, generally causing abnormal termination of the process, and sometimes a core dump.

Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

A struct in the C programming language is a composite data type declaration that defines a physically grouped list of variables under one name in a block of memory, allowing the different variables to be accessed via a single pointer or by the struct declared name which returns the same address. The struct data type can contain other data types so is used for mixed-data-type records such as a hard-drive directory entry, or other mixed-type records.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

In computer programming, the term hooking covers a range of techniques used to alter or augment the behaviour of an operating system, of applications, or of other software components by intercepting function calls or messages or events passed between software components. Code that handles such intercepted function calls, events or messages is called a hook.

A scanf format string is a control parameter used in various functions to specify the layout of an input string. The functions can then divide the string and translate into values of appropriate data types. String scanning functions are often supplied in standard libraries.Scanf is a function that reads formatted data from the standard input string, which is usually the keyboard and writes the results whenever called in the specified arguments.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

select is a system call and application programming interface (API) in Unix-like and POSIX-compliant operating systems for examining the status of file descriptors of open input/output channels. The select system call is similar to the poll facility introduced in UNIX System V and later operating systems. However, with the c10k problem, both select and poll have been superseded by the likes of kqueue, epoll, /dev/poll and I/O completion ports.

C's offsetof macro is an ANSI C library feature found in stddef.h. It evaluates to the offset of a given member within a struct or union type, an expression of type size_t. The offsetof macro takes two parameters, the first being a structure name, and the second being the name of a member within the structure. It cannot be described as a C prototype.

In software, a stack buffer overflow or stack buffer overrun occurs when a program writes to a memory address on the program's call stack outside of the intended data structure, which is usually a fixed-length buffer. Stack buffer overflow bugs are caused when a program writes more data to a buffer located on the stack than what is actually allocated for that buffer. This almost always results in corruption of adjacent data on the stack, and in cases where the overflow was triggered by mistake, will often cause the program to crash or operate incorrectly. Stack buffer overflow is a type of the more general programming malfunction known as buffer overflow. Overfilling a buffer on the stack is more likely to derail program execution than overfilling a buffer on the heap because the stack contains the return addresses for all active function calls.

<span class="mw-page-title-main">Secure coding</span> Software development methodology

Secure coding is the practice of developing computer software in such a way that guards against the accidental introduction of security vulnerabilities. Defects, bugs and logic flaws are consistently the primary cause of commonly exploited software vulnerabilities. Through the analysis of thousands of reported vulnerabilities, security professionals have discovered that most vulnerabilities stem from a relatively small number of common software programming errors. By identifying the insecure coding practices that lead to these errors and educating developers on secure alternatives, organizations can take proactive steps to help significantly reduce or eliminate vulnerabilities in software before deployment.

The write is one of the most basic routines provided by a Unix-like operating system kernel. It writes data from a buffer declared by the user to a given device, such as a file. This is the primary way to output data from a program by directly using a system call. The destination is identified by a numeric code. The data to be written, for instance a piece of text, is defined by a pointer and a size, given in number of bytes.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of $n$ characters is represented as an array of $n + 1$ elements, the last of which is a NUL character.

References

↑ strcpy
↑ memcpy()
↑ "ISO/IEC 9899:TC3 (Current C standard)" (PDF). 2007-09-07. p. 126. Retrieved 2008-09-26. Section 6.7.8, paragraph 10.
↑ "Java Language Specification: 4.12.5 Initial Values of Variables". Sun Microsystems . Retrieved 2008-10-18.

Uninitialized variable

Contents

Example of the C language

Impacts

Use in languages

See also

Related Research Articles

References

Further reading