String interning

Last updated

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. [1] Interning strings makes some string processing tasks more time-efficient or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

Contents

The single copy of each string is called its intern and is typically looked up by a method of the string class, for example String.intern() [2] in Java. All compile-time constant strings in Java are automatically interned using this method. [3]

String interning is supported by some modern object-oriented programming languages, including Java, Python, PHP (since 5.4), Lua [4] and .NET languages. [5] Lisp, Scheme, Julia, Ruby and Smalltalk are among the languages with a symbol type that are basically interned strings. The library of the Standard ML of New Jersey contains an atom type that does the same thing. Objective-C's selectors, which are mainly used as method names, are interned strings.

Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any boolean, any byte, any char from 0 to 127, and any short or int between −128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object. [6]

History

Lisp introduced the notion of interned strings for its symbols. Historically, the data structure used as a string intern pool was called an oblist (when it was implemented as a linked list) or an obarray (when it was implemented as an array).

Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose name is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.

Motivation

String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on associative arrays with string keys to look up the attributes and methods of an object. Without interning, comparing two distinct strings may involve examining every character of both. [Note 1] This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fill up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.

String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory. Network transfer of objects over Java RMI serialization object streams can transfer strings that are interned more efficiently, as the String object's handle is used in place of duplicate objects upon serialization. [7]

Issues

Multithreading

If the interned strings are not immutable, one source of drawbacks is that string interning may be problematic when mixed with multithreading. In many systems, string interns are required to be global across all threads within an address space (or across any contexts which may share pointers), thus the intern pool(s) are global resources that should be synchronized for safe concurrent access. While this only affects string creation (where the intern pool must be checked and modified if necessary), and double-checked locking may be used on platforms where this is a safe optimization, the need for mutual exclusion when modifying the intern pool can be expensive. [8]

Contention can also be reduced by partitioning the string space into multiple pools, which can be synchronized independently of one another.

Reclaiming unused interned strings

Many implementations of interned strings do not attempt to reclaim (manually or otherwise) strings that are no longer used. For applications where the number of interned strings is small or fixed, or which are short-lived, the loss of system resources may be tolerable. But for long-running systems where large numbers of string interns are created at runtime, the need to reclaim unused interns may arise. This task can be handled by a garbage collector, though for this to work correctly weak references to string interns must be stored in the intern pool.

See also

Notes

  1. The string comparison can halt at the first character mismatch. For strict equality, the lengths of the strings can also be compared before traversing the string: but finding the length of null-terminated strings does itself require traversing the string.

Related Research Articles

<span class="mw-page-title-main">Java (programming language)</span> Object-oriented programming language

Java is a high-level, class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers write once, run anywhere (WORA), meaning that compiled Java code can run on all platforms that support Java without the need to recompile. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of the underlying computer architecture. The syntax of Java is similar to C and C++, but has fewer low-level facilities than either of them. The Java runtime provides dynamic capabilities that are typically not available in traditional compiled languages.

In computer science, reference counting is a programming technique of storing the number of references, pointers, or handles to a resource, such as an object, a block of memory, disk space, and others.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

Java Platform, Standard Edition is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE).

Thread safety is a computer programming concept applicable to multi-threaded code. Thread-safe code only manipulates shared data structures in a manner that ensures that all threads behave properly and fulfill their design specifications without unintended interaction. There are various strategies for making thread-safe data structures.

In object-oriented (OO) and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. In some cases, an object is considered immutable even if some internally used attributes change, but the object's state appears unchanging from an external point of view. For example, an object that uses memoization to cache the results of expensive computations could still be considered an immutable object.

In computer programming, a reference is a value that enables a program to indirectly access a particular data, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

A method in object-oriented programming (OOP) is a procedure associated with an object, and generally also a message. An object consists of state data and behavior; these compose an interface, which specifies how the object may be used. A method is a behavior of an object parametrized by a user.

In computer science, a dynamic programming language is a class of high-level programming languages, which at runtime execute many common programming behaviours that static programming languages perform during compilation. These behaviors could include an extension of the program, by adding new code, by extending objects and definitions, or by modifying the type system. Although similar behaviors can be emulated in nearly any language, with varying degrees of difficulty, complexity and performance costs, dynamic languages provide direct tools to make use of them. Many of these features were first implemented as native features in the Lisp programming language.

In computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled programs must use. Most processors support a similar set of primitive data types, although the specific representations vary. More generally, "primitive data types" may refer to the standard data types built into a programming language. Data types which are not primitive are referred to as derived or composite.

In some programming languages, eval, short for the English evaluate, is a function which evaluates a string as though it were an expression in the language, and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval. The input to eval is not necessarily a string; it may be structured representation of code, such as an abstract syntax tree, or of special type such as code. The analog for a statement is exec, which executes a string as if it were a statement; in some languages, such as Python, both are present, while in other languages only one of either eval or exec is.

This article compares two programming languages: C# with Java. While the focus of this article is mainly the languages and their features, such a comparison will necessarily also consider some features of platforms and libraries. For a more detailed comparison of the platforms, see Comparison of the Java and .NET platforms.

A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written or compiled in another one. An FFI is often used in contexts where calls are made into binary dynamic-link library.

In computer science, marshalling or marshaling is the process of transforming the memory representation of an object into a data format suitable for storage or transmission, especially between different runtimes. It is typically used when data must be moved between different parts of a computer program or from one program to another.

A symbol in computer programming is a primitive data type whose instances have a human-readable form. Symbols can be used as identifiers. In some programming languages, they are called atoms. Uniqueness is enforced by holding them in a symbol table. The most common use of symbols by programmers is to perform language reflection, and the most common indirectly is their use to create object linkages.

In computer programming, a constant is a value that is not altered by the program during normal execution. When associated with an identifier, a constant is said to be "named," although the terms "constant" and "named constant" are often used interchangeably. This is contrasted with a variable, which is an identifier with a value that can be changed during normal execution. To simplify, constants' values remains, while the values of variables varies, both hence their names.

In computer science, interning is re-using objects of equal value on-demand instead of creating new objects. This creational pattern is frequently used for numbers and strings in different programming languages. In many object-oriented languages such as Python, even primitive types such as integer numbers are objects. To avoid the overhead of constructing a large number of integer objects, these objects get reused through interning.

References

  1. "String.Intern Method (String)". Microsoft Developer Network. Retrieved 25 March 2017.
  2. String.intern()
  3. "Chapter 15. Expressions". docs.oracle.com. Retrieved 30 January 2019.
  4. "lua-users wiki: Immutable Objects". lua-users.org. Retrieved 30 January 2019.
  5. rpetrusha. "String Class (System)". docs.microsoft.com. Retrieved 30 January 2019.
  6. "Chapter 5. Conversions and Promotions". docs.oracle.com. Retrieved 30 January 2019.
  7. "Java Object Serialization Specification: 1 - System Architecture". docs.oracle.com. Retrieved 30 January 2019.
  8. "String.intern in Java 6, 7 and 8 - multithreaded access". java-performance.info. 3 September 2013. Retrieved 30 January 2019.