Garbage (computer science)

Last updated

In computer science, garbage includes data, objects, or other regions of the memory of a computer system (or other system resources), which will not be used in any future computation by the system, or by a program running on it. Because every computer system has a finite amount of memory, and most software produces garbage, it is frequently necessary to deallocate memory that is occupied by garbage and return it to the heap, or memory pool, for reuse.

Contents

Classification

Garbage is generally classified into two types: syntactic garbage, any object or data which is within a program's memory space but unreachable from the program's root set; and semantic garbage, any object or data which is never accessed by a running program for any combination of program inputs. Objects and data which are not garbage are said to be live.

Casually stated, syntactic garbage is data that cannot be reached, and semantic garbage is data that will not be reached. More precisely, syntactic garbage is data that is unreachable due to the reference graph (there is no path to it), which can be determined by many algorithms, as discussed in tracing garbage collection, and only requires analyzing the data, not the code. Semantic garbage is data that will not be accessed, either because it is unreachable (hence also syntactic garbage), or is reachable but will not be accessed; this latter requires analysis of the code, and is in general an undecidable problem.

Syntactic garbage is a (usually strict) subset of semantic garbage, as it is entirely possible for an object to hold a reference to another object without ever using that object.

Example

In the following simple stack implementation in Java, each element popped from the stack becomes semantic garbage once there are no outside references to it: [lower-alpha 1]

publicclassStack{privateObject[]elements;privateintsize;publicStack(intcapacity){elements=newObject[capacity];}publicvoidpush(Objecte){elements[size++]=e;}publicObjectpop(){returnelements[--size];}}

This is because elements[] still contains a reference to the object, but the object will never be accessed again through this reference, because elements[] is private to the class and the pop method only returns references to elements it has not already popped. (After it decrements size, this class will never access that element again.) However, knowing this requires analysis of the code of the class, which is undecidable in general.

If a later push call re-grows the stack to the previous size, overwriting this last reference, then the object will become syntactic garbage, because it can never be accessed again, and will be eligible for garbage collection.

Automatic garbage collection

An example of the automatic collection of syntactic garbage, by reference counting garbage collection, can be produced using the Python command-line interpreter:

>>> classFoo:... """This is an empty testing class."""... pass...>>> bar=Foo()>>> bar<__main__.Foo object at 0x54f30>>>> delbar

In this session, an object is created, its location in the memory is displayed, and the only reference to the object is then destroyed—there is no way to ever use the object again from this point on, as there are no references to it. This becomes apparent when we try to access the original reference:

>>> barTraceback (most recent call last):   File "<stdin>", line 1, in ?NameError: name 'bar' is not defined

As it is now impossible to refer to the object, the object has become useless; it is garbage. Since Python uses garbage collection, it automatically deallocates the memory that was used for the object so that it may be used again:

>>> classBar:... """This is another testing class."""... pass...>>> baz=Bar()>>> baz<__main__.Bar object at 0x54f30>

The Bar instance now resides at the memory location 0x54f30; at the same place as where our previous object, the Foo instance, was located. Since the Foo instance was destroyed, freeing up the memory used to contain it, the interpreter creates the Bar object at the same memory location as before, making good use of the available resources.

Effects

Garbage consumes heap memory, and thus one wishes to collect it (to minimize memory use, allow faster memory allocation, and prevent out-of-memory errors by reducing heap fragmentation and memory use).

However, collecting garbage takes time and, if done manually, requires coding overhead. Further, collecting garbage destroys objects and thus can cause calls to finalizers, executing potentially arbitrary code at an arbitrary point in the program's execution. Incorrect garbage collection (deallocating memory that is not garbage), primarily due to errors in manual garbage collection (rather than errors in garbage collectors), results in memory safety violations (that often create security holes) due to use of dangling pointers.

Syntactic garbage can be collected automatically, and garbage collectors have been extensively studied and developed. Semantic garbage cannot be automatically collected in general, and thus causes memory leaks even in garbage-collected languages. Detecting and eliminating semantic garbage is typically done using a specialized debugging tool called a heap profiler, which allows one to see which objects are live and how they are reachable, enabling one to remove the unintended reference.

Eliminating garbage

The problem of managing the deallocation of garbage is well-known in computer science. Several approaches are taken:

Notes

  1. Simplified from Effective Java Item 6 by omitting resizing and explicit exceptions.

Related Research Articles

Garbage collection (computer science) Form of automatic memory management

In computer science, garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced—also called garbage. Garbage collection was invented by American computer scientist John McCarthy around 1959 to simplify manual memory management in Lisp.

In computer science, a memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in a way that memory which is no longer needed is not released. A memory leak may also happen when an object is stored in memory but cannot be accessed by the running code. A memory leak has symptoms similar to a number of other problems and generally can only be diagnosed by a programmer with access to the program's source code.

Java and C++ are two prominent object-oriented programming languages. By many language popularity metrics, the two languages have dominated object-oriented and high-performance software development for much of the 21st century, and are often directly compared and contrasted. Java appeared about 10 years later and its syntax was based on C/C++.

In programming languages, a closure, also lexical closure or function closure, is a technique for implementing lexically scoped name binding in a language with first-class functions. Operationally, a closure is a record storing a function together with an environment. The environment is a mapping associating each free variable of the function with the value or reference to which the name was bound when the closure was created. Unlike a plain function, a closure allows the function to access those captured variables through the closure's copies of their values or references, even when the function is invoked outside their scope.

Memory management Computer memory management methodology

Memory management is a form of resource management applied to computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when no longer needed. This is critical to any advanced computer system where more than a single process might be underway at any time.

In object-oriented programming (OOP), the object lifetime of an object is the time between an object's creation and its destruction. Rules for object lifetime vary significantly between languages, in some cases between implementations of a given language, and lifetime of a particular object may vary from one run of the program to another.

C dynamic memory allocation refers to performing manual memory management for dynamic memory allocation in the C programming language via a group of functions in the C standard library, namely malloc, realloc, calloc and free.

In computer programming, tracing garbage collection is a form of automatic memory management that consists of determining which objects should be deallocated by tracing which objects are reachable by a chain of references from certain "root" objects, and considering the rest as "garbage" and collecting them. Tracing garbage collection is the most common type of garbage collection – so much so that "garbage collection" often refers to tracing garbage collection, rather than other methods such as reference counting – and there are a large number of algorithms used in implementation.

Pointer (computer programming) Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

Dangling pointer Pointer that does not point to a valid object

Dangling pointers and wild pointers in computer programming are pointers that do not point to a valid object of the appropriate type. These are special cases of memory safety violations. More generally, dangling references and wild references are references that do not resolve to a valid destination, and include such phenomena as link rot on the internet.

In computer science, a finalizer or finalize method is a special method that performs finalization, generally some form of cleanup. A finalizer is executed during object destruction, prior to the object being deallocated, and is complementary to an initializer, which is executed during object creation, following allocation. Finalizers are strongly discouraged by some, due to difficulty in proper use and the complexity they add, and alternatives are suggested instead, mainly the dispose pattern – see problems with finalizers.

In object-oriented programming, a destructor is a method which is invoked mechanically just before the memory of the object is released. It can happen when its lifetime is bound to scope and the execution leaves the scope, when it is embedded in another object whose lifetime ends, or when it was allocated dynamically and is released explicitly. Its main purpose is to free the resources which were acquired by the object during its life and/or deregister from other entities which may keep references to it. Use of destructors is needed for the process of Resource Acquisition Is Initialization (RAII).

In computer science, unreachable memory is a block of memory allocated dynamically where the program that allocated the memory no longer has any reachable pointer that refers to it. Similarly, an unreachable object is a dynamically allocated object that has no reachable reference to it. Informally, unreachable memory is dynamic memory that the program can not reach directly, nor get to by starting at an object it can reach directly, and then following a chain of pointer references.

In computer science, manual memory management refers to the usage of manual instructions by the programmer to identify and deallocate unused objects, or garbage. Up until the mid-1990s, the majority of programming languages used in industry supported manual memory management, though garbage collection has existed since 1959, when it was introduced with Lisp. Today, however, languages with garbage collection such as Java are increasingly popular and the languages Objective-C and Swift provide similar functionality through Automatic Reference Counting. The main manually managed languages still in widespread use today are C and C++ – see C dynamic memory allocation.

In compiler optimization, escape analysis is a method for determining the dynamic scope of pointers – where in the program a pointer can be accessed. It is related to pointer analysis and shape analysis.

Memory safety is the state of being protected from various software bugs and security vulnerabilities when dealing with memory access, such as buffer overflows and dangling pointers. For example, Java is said to be memory-safe because its runtime error detection checks array bounds and pointer dereferences. In contrast, C and C++ allow arbitrary pointer arithmetic with pointers implemented as direct memory addresses with no provision for bounds checking, and thus are potentially memory-unsafe.

In computer programming, a variable is an abstract storage location paired with an associated symbolic name, which contains some known or unknown quantity of information referred to as a value; or in simpler terms, a variable is a container for a particular set of bits or type of data. A variable can eventually be associated with or identified by a memory address. The variable name is the usual way to reference the stored value, in addition to referring to the variable itself, depending on the context. This separation of name and content allows the name to be used independently of the exact information it represents. The identifier in computer source code can be bound to a value during run time, and the value of the variable may thus change during the course of program execution.

In computer programming, a constant is a value that should not be altered by the program during normal execution, i.e., the value is constant. When associated with an identifier, a constant is said to be "named," although the terms "constant" and "named constant" are often used interchangeably. This is contrasted with a variable, which is an identifier with a value that can be changed during normal execution, i.e., the value is variable. Constants are useful for both programmers and compilers: For programmers they are a form of self-documenting code and allow reasoning about correctness, while for compilers they allow compile-time and run-time checks that verify that constancy assumptions are not violated, and allow or simplify some compiler optimizations.

In computer science, region-based memory management is a type of memory management in which each allocated object is assigned to a region. A region, also called a zone, arena, area, or memory context, is a collection of allocated objects that can be efficiently reallocated or deallocated all at once. Like stack allocation, regions facilitate allocation and deallocation of memory with low overhead; but they are more flexible, allowing objects to live longer than the stack frame in which they were allocated. In typical implementations, all objects in a region are allocated in a single contiguous range of memory addresses, similarly to how stack frames are typically allocated.

In object-oriented programming languages with garbage collection, object resurrection is when an object comes back to life during the process of object destruction, as a side effect of a finalizer being executed.