In computer science, string interning is a method of storing only one copy of each distinct
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
value, which must be
immutable
In object-oriented and functional programming, an immutable object (unchangeable object) is an object whose state cannot be modified after it is created.Goetz et al. ''Java Concurrency in Practice''. Addison Wesley Professional, 2006, Section 3.4 ...
. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
The single copy of each string is called its ''intern'' and is typically looked up by a method of the string class, for example String.intern() in
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
. All compile-time constant strings in Java are automatically interned using this method.
String interning is supported by some modern
object-oriented
Object-oriented programming (OOP) is a programming paradigm based on the concept of " objects", which can contain data and code. The data is in the form of fields (often known as attributes or ''properties''), and the code is in the form of ...
programming language
A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language.
The description of a programming l ...
s, including Java,
Python,
PHP
PHP is a General-purpose programming language, general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementati ...
(since 5.4),
Lua
Lua or LUA may refer to:
Science and technology
* Lua (programming language)
* Latvia University of Agriculture
* Last universal ancestor, in evolution
Ethnicity and language
* Lua people, of Laos
* Lawa people, of Thailand sometimes referred t ...
and
.NET languages
CLI languages are computer programming languages that are used to produce libraries and programs that conform to the Common Language Infrastructure (CLI) specifications. With some notable exceptions, most CLI languages compile entirely to the Com ...
.
Lisp
A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech.
Types
* A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lispi ...
,
Scheme,
Julia
Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e ...
,
Ruby
A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum (aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapp ...
and
Smalltalk
Smalltalk is an object-oriented, dynamically typed reflective programming language. It was designed and created in part for educational use, specifically for constructionist learning, at the Learning Research Group (LRG) of Xerox PARC by ...
are among the languages with a
symbol
A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise very different conc ...
type that are basically interned strings. The library of the
Standard ML of New Jersey contains an
atom
type that does the same thing.
Objective-C
Objective-C is a general-purpose, object-oriented programming language that adds Smalltalk-style messaging to the C programming language. Originally developed by Brad Cox and Tom Love in the early 1980s, it was selected by NeXT for its N ...
's selectors, which are mainly used as method names, are interned strings.
Objects other than strings can be interned. For example, in Java, when primitive values are
boxed Boxed may refer to:
* Boxed.com, a wholesale on-line shopping site.
* ''Boxed'' (Eurythmics), an eight album box set
* ''Boxed'' (Mike Oldfield album)
* Boxed warning
In the United States, a boxed warning (sometimes "black box warning", colloq ...
into a
wrapper object, certain values (any
boolean
, any
byte
, any
char
from 0 to 127, and any
short
or
int
between −128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object.
History
Lisp
A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech.
Types
* A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lispi ...
introduced the notion of interned strings for its
symbols
A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise very different co ...
. Historically, the data structure used as a string intern pool was called an ''oblist'' (when it was implemented as a linked list) or an ''obarray'' (when it was implemented as an array).
Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose ''name'' is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.
Motivation
String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
s and
dynamic programming language
In computer science, a dynamic programming language is a class of high-level programming languages, which at runtime execute many common programming behaviours that static programming languages perform during compilation. These behaviors co ...
runtimes) that rely heavily on
associative array
In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an ...
s with string keys to look up the attributes and methods of an object. Without interning, comparing two distinct strings may involve examining every character of both.
[The string comparison can halt at the first character mismatch. For strict equality, the lengths of the strings can also be compared before traversing the string: but finding the length of null-terminated strings does itself require traversing the string.] This is slow for several reasons: it is inherently
O(n) in the length of the strings; it typically requires reads from several regions of
memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...
, which take time; and the reads fill up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple
object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a
network or from
storage
Storage may refer to:
Goods Containers
* Dry cask storage, for storing high-level radioactive waste
* Food storage
* Intermodal container, cargo shipping
* Storage tank
Facilities
* Garage (residential), a storage space normally used to store car ...
. Such strings may include
magic numbers or
network protocol
A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchroni ...
information. For example, XML parsers may intern names of tags and attributes to save memory. Network transfer of objects over Java RMI serialization object streams can transfer strings that are interned more efficiently, as the String object's handle is used in place of duplicate objects upon serialization.
Issues
Multithreading
If the interned strings are not immutable, one source of drawbacks is that string interning may be problematic when mixed with
multithreading. In many systems, string interns are required to be global across all threads within an address space (or across any contexts which may share pointers), thus the intern pool(s) are global resources that should be synchronized for safe concurrent access. While this only affects string creation (where the intern pool must be checked and modified if necessary), and
double-checked locking may be used on platforms where this is a safe optimization, the need for mutual exclusion when modifying the intern pool can be expensive.
Contention can also be reduced by partitioning the string space into multiple pools, which can be synchronized independently of one another.
Reclaiming unused interned strings
Many implementations of interned strings do not attempt to reclaim (manually or otherwise) strings that are no longer used. For applications where the number of interned strings is small or fixed, or which are short-lived, the loss of system resources may be tolerable. But for long-running systems where large numbers of string interns are created at runtime, the need to reclaim unused interns may arise. This task can be handled by a
garbage collector, though for this to work correctly
weak reference
In computer programming, a weak reference is a reference that does not protect the referenced object from collection by a garbage collector, unlike a strong reference. An object referenced ''only'' by weak references – meaning "every chain of r ...
s to string interns must be stored in the intern pool.
See also
*
Flyweight pattern
Notes
References
{{Reflist
External links
Visual J# String class.NET String ClassGuava Java Library - Interner - Non-permgen String.intern and supports other immutable types with weak and strong referenced implementations
Software optimization
Interning