HOME

TheInfoList



OR:

In
computer programming Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...
, a null-terminated string is a
character string In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...
stored as an array containing the characters and terminated with a ''
null character The null character is a control character with the value zero. Many character sets include a code point for a null character including Unicode (Universal Coded Character Set), ASCII (ISO/IEC 646), Baudot, ITA2 codes, the C0 control code, and EB ...
'' (a character with an internal value of zero, called "NUL" in this article, not same as the
glyph A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...
zero). Alternative names are '' C string'', which refers to the
C programming language C (''pronounced'' '' – like the letter c'') is a general-purpose programming language. It was created in the 1970s by Dennis Ritchie and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of ...
and ASCIIZ (although C can use encodings other than
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
). The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(''n'') (
linear time In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations ...
) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not the string).


History

Null-terminated strings were produced by the .ASCIZ directive of the
PDP-11 The PDP–11 is a series of 16-bit minicomputers originally sold by Digital Equipment Corporation (DEC) from 1970 into the late 1990s, one of a set of products in the Programmed Data Processor (PDP) series. In total, around 600,000 PDP-11s of a ...
assembly language In computing, assembly language (alternatively assembler language or symbolic machine code), often referred to simply as assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence bet ...
s and the ASCIZ directive of the
MACRO-10 MACRO-10 is an assembly language with extensive macro facilities for DEC's PDP-10-based Mainframe computer systems, the DECsystem-10 and the DECSYSTEM-20. MACRO-10 is implemented as a two-pass assembler. Programming examples A simple " Hello, ...
macro assembly language for the
PDP-10 Digital Equipment Corporation (DEC)'s PDP-10, later marketed as the DECsystem-10, is a mainframe computer family manufactured beginning in 1966 and discontinued in 1983. 1970s models and beyond were marketed under the DECsystem-10 name, especi ...
. These predate the development of the C programming language, but other forms of strings were often used. At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is " length-prefixed"), used a leading ''byte'' to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time), but limited string length to 255 characters. C designer
Dennis Ritchie Dennis MacAlistair Ritchie (September 9, 1941 – October 12, 2011) was an American computer scientist. He created the C programming language and the Unix operating system and B language with long-time colleague Ken Thompson. Ritchie and Thomp ...
chose to follow the convention of null-termination to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator. This had some influence on CPU
instruction set In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, s ...
design. Some CPUs in the 1970s and 1980s, such as the
Zilog Z80 The Zilog Z80 is an 8-bit computing, 8-bit microprocessor designed by Zilog that played an important role in the evolution of early personal computing. Launched in 1976, it was designed to be Backward compatibility, software-compatible with the ...
and the DEC VAX, had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000 520 in 1992 and the vector string instructions to the IBM z13 in 2015.IBM z/Architecture Principles of Operation
/ref>
FreeBSD FreeBSD is a free-software Unix-like operating system descended from the Berkeley Software Distribution (BSD). The first version was released in 1993 developed from 386BSD, one of the first fully functional and free Unix clones on affordable ...
developer Poul-Henning Kamp, writing in ''
ACM Queue ACM ''Queue'' (stylized ''acmqueue'') is a bimonthly computer magazine, targeted to software engineer Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining softwar ...
'', referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.


Limitations

While simple to implement, this representation has been prone to errors and performance problems. Null-termination has historically created security problems. A NUL inserted into the middle of a string will truncate it unexpectedly. A common bug was to not allocate the additional space for the NUL, so it was written over adjacent memory. Another was to not write the NUL at all, which was often not detected during testing because the block of memory already contained zeros. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size buffer, causing a buffer overflow if it was too long. The inability to store a zero requires that text and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used. The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(''n'') anyway, such as in
strlcpy The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. F ...
. However, this does not always result in an intuitive
API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
.


Character encodings

Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possible
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
or
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
string. However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use " modified UTF-8" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an overlong encoding, and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
characters, terminated by a 16-bit NUL (0x0000).


Improvements

Many attempts to make C string handling less error prone have been made. One strategy is to add safer functions such as strdup and
strlcpy The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. F ...
, whilst deprecating the use of unsafe functions such as gets. Another is to add an object-oriented wrapper around C strings so that only safe calls can be done. However, it is possible to call the unsafe functions anyway. Most modern libraries replace C strings with a structure containing a 32-bit or larger length value (far more than were ever considered for length-prefixed strings), and often add another pointer, a reference count, and even a NUL to speed up conversion back to a C string. Memory is far larger now, such that if the addition of 3 (or 16, or more) bytes to each string is a real problem the software will have to be dealing with so many small strings that some other storage method will save even more memory (for instance there may be so many duplicates that a
hash table In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...
will use less memory). Examples include the C++
Standard Template Library The Standard Template Library (STL) is a software library originally designed by Alexander Stepanov for the C++ programming language that influenced many parts of the C++ Standard Library. It provides four components called ''algorithms'', '' ...
std::string, the Qt QString, the MFC CString, and the C-based implementation CFString from
Core Foundation Core Foundation (also called CF) is a C application programming interface (API) written by Apple Inc. for its operating systems, and is a mix of low-level routines and wrapper functions. Most Core Foundation routines follow a certain naming c ...
as well as its
Objective-C Objective-C is a high-level general-purpose, object-oriented programming language that adds Smalltalk-style message passing (messaging) to the C programming language. Originally developed by Brad Cox and Tom Love in the early 1980s, it was ...
sibling NSString from Foundation, both by Apple. More complex structures may also be used to store strings such as the
rope A rope is a group of yarns, Plying, plies, fibres, or strands that are plying, twisted or braided together into a larger and stronger form. Ropes have high tensile strength and can be used for dragging and lifting. Rope is thicker and stronger ...
.


See also

*
Empty string In formal language theory, the empty string, or empty word, is the unique String (computer science), string of length zero. Formal theory Formally, a string is a finite, ordered sequence of character (symbol), characters such as letters, digits ...
*
Sentinel value In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically ...


References

{{Data types String data structures es:C string