String interning

Last updated

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. [1] Interning strings makes some string processing tasks more time-efficient or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

Contents

The single copy of each string is called its intern and is typically looked up by a method of the string class, for example String.intern() [2] in Java. All compile-time constant strings in Java are automatically interned using this method. [3]

String interning is supported by some modern object-oriented programming languages, including Java, Python, PHP (since 5.4), Lua [4] and .NET languages. [5] Lisp, Scheme, Julia, Ruby and Smalltalk are among the languages with a symbol type that are basically interned strings. The library of the Standard ML of New Jersey contains an atom type that does the same thing. Objective-C's selectors, which are mainly used as method names, are interned strings.

Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any boolean, any byte, any char from 0 to 127, and any short or int between −128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object. [6]

History

Lisp introduced the notion of interned strings for its symbols. Historically, the data structure used as a string intern pool was called an oblist (when it was implemented as a linked list) or an obarray (when it was implemented as an array).

Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose name is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.

Motivation

String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on associative arrays with string keys to look up the attributes and methods of an object. Without interning, comparing two distinct strings may involve examining every character of both. [Note 1] This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fill up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.

String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory. Network transfer of objects over Java RMI serialization object streams can transfer strings that are interned more efficiently, as the String object's handle is used in place of duplicate objects upon serialization. [7]

Issues

Multithreading

If the interned strings are not immutable, one source of drawbacks is that string interning may be problematic when mixed with multithreading. In many systems, string interns are required to be global across all threads within an address space (or across any contexts which may share pointers), thus the intern pool(s) are global resources that should be synchronized for safe concurrent access. While this only affects string creation (where the intern pool must be checked and modified if necessary), and double-checked locking may be used on platforms where this is a safe optimization, the need for mutual exclusion when modifying the intern pool can be expensive. [8]

Contention can also be reduced by partitioning the string space into multiple pools, which can be synchronized independently of one another.

Reclaiming unused interned strings

Many implementations of interned strings do not attempt to reclaim (manually or otherwise) strings that are no longer used. For applications where the number of interned strings is small or fixed, or which are short-lived, the loss of system resources may be tolerable. But for long-running systems where large numbers of string interns are created at runtime, the need to reclaim unused interns may arise. This task can be handled by a garbage collector, though for this to work correctly weak references to string interns must be stored in the intern pool.

See also

Notes

  1. The string comparison can halt at the first character mismatch. For strict equality, the lengths of the strings can also be compared before traversing the string: but finding the length of null-terminated strings does itself require traversing the string.

Related Research Articles

In computer science, reference counting is a programming technique of storing the number of references, pointers, or handles to a resource, such as an object, a block of memory, disk space, and others.

<span class="mw-page-title-main">Serialization</span> Conversion process for computer data

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of objects does not include any of their associated methods with which they were previously linked.

Java Platform, Standard Edition is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE).

In multi-threaded computer programming, a function is thread-safe when it can be invoked or accessed concurrently by multiple threads without causing unexpected behavior, race conditions, or data corruption. As in the multi-threaded context where a program executes several threads simultaneously in a shared address space and each of those threads has access to every other thread's memory, thread-safe functions need to ensure that all those threads behave properly and fulfill their design specifications without unintended interaction.

In object-oriented (OO) and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. In some cases, an object is considered immutable even if some internally used attributes change, but the object's state appears unchanging from an external point of view. For example, an object that uses memoization to cache the results of expensive computations could still be considered an immutable object.

In computer programming, a reference is a value that enables a program to indirectly access a particular datum, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

A dynamic programming language is a type of programming language that allows various operations to be determined and executed at runtime. This is different from the compilation phase. Key decisions about variables, method calls, or data types are made when the program is running, unlike in static languages, where the structure and types are fixed during compilation. Dynamic languages provide flexibility. This allows developers to write more adaptable and concise code.

In computer science, reflective programming or reflection is the ability of a process to examine, introspect, and modify its own structure and behavior.

In computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled programs must use. Most processors support a similar set of primitive data types, although the specific representations vary. More generally, primitive data types may refer to the standard data types built into a programming language. Data types which are not primitive are referred to as derived or composite.

In object-oriented programming such as is often used in C++ and Object Pascal, a virtual function or virtual method is an inheritable and overridable function or method that is dispatched dynamically. Virtual functions are an important part of (runtime) polymorphism in object-oriented programming (OOP). They allow for the execution of target functions that were not precisely identified at compile time.

In some programming languages, eval, short for evaluate, is a function which evaluates a string as though it were an expression in the language, and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval. The input to eval is not necessarily a string; it may be structured representation of code, such as an abstract syntax tree, or of special type such as code. The analog for a statement is exec, which executes a string as if it were a statement; in some languages, such as Python, both are present, while in other languages only one of either eval or exec is.

This article compares two programming languages: C# with Java. While the focus of this article is mainly the languages and their features, such a comparison will necessarily also consider some features of platforms and libraries.

<span class="mw-page-title-main">Java syntax</span> Set of rules defining correctly structured program

The syntax of Java is the set of rules defining how a Java program is written and interpreted.

In the Java programming language, the final keyword is used in several contexts to define an entity that can only be assigned once.

A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written or compiled in another one. An FFI is often used in contexts where calls are made into a binary dynamic-link library.

In computer science, marshalling or marshaling is the process of transforming the memory representation of an object into a data format suitable for storage or transmission, especially between different runtimes. It is typically used when data must be moved between different parts of a computer program or from one program to another.

A symbol in computer programming is a primitive data type whose instances have a human-readable form. Symbols can be used as identifiers. In some programming languages, they are called atoms. Uniqueness is enforced by holding them in a symbol table. The most common use of symbols by programmers is to perform language reflection, and the most common indirectly is their use to create object linkages.

In computer programming, a constant is a value that is not altered by the program during normal execution. When associated with an identifier, a constant is said to be "named," although the terms "constant" and "named constant" are often used interchangeably. This is contrasted with a variable, which is an identifier with a value that can be changed during normal execution. To simplify, constants' values remains, while the values of variables varies, hence both their names.

In computer science, interning is re-using objects of equal value on-demand instead of creating new objects. This creational pattern is frequently used for numbers and strings in different programming languages. In many object-oriented languages such as Python, even primitive types such as integer numbers are objects. To avoid the overhead of constructing a large number of integer objects, these objects get reused through interning.

References

  1. "String.Intern Method (String)". Microsoft Developer Network. Retrieved 25 March 2017.
  2. String.intern()
  3. "Chapter 15. Expressions". docs.oracle.com. Retrieved 30 January 2019.
  4. "lua-users wiki: Immutable Objects". lua-users.org. Retrieved 30 January 2019.
  5. rpetrusha. "String Class (System)". docs.microsoft.com. Retrieved 30 January 2019.
  6. "Chapter 5. Conversions and Promotions". docs.oracle.com. Retrieved 30 January 2019.
  7. "Java Object Serialization Specification: 1 - System Architecture". docs.oracle.com. Retrieved 30 January 2019.
  8. "String.intern in Java 6, 7 and 8 - multithreaded access". java-performance.info. 3 September 2013. Retrieved 30 January 2019.