In
character encoding
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
terminology, a code point, codepoint or code position is a numerical value that maps to a specific
character
Character or Characters may refer to:
Arts, entertainment, and media Literature
* ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk
* ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
. Code points usually represent a single
grapheme
In linguistics, a grapheme is the smallest functional unit of a writing system.
The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called '' graphemi ...
—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols,
control characters
In computing and telecommunication, a control character or non-printing character (NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than th ...
, or formatting. The set of all possible code points within a given encoding/character set make up that encoding's ''codespace''.
For example, the character encoding scheme
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
comprises 128 code points in the range 0
hex to 7F
hex,
Extended ASCII
Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...
comprises 256 code points in the range 0
hex to FF
hex, and
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
comprises code points in the range 0
hex to 10FFFF
hex. The Unicode code space is divided into seventeen
planes
Plane(s) most often refers to:
* Aero- or airplane, a powered, fixed-wing aircraft
* Plane (geometry), a flat, 2-dimensional surface
Plane or planes may also refer to:
Biology
* Plane (tree) or ''Platanus'', wetland native plant
* ''Planes' ...
(the basic multilingual plane, and 16 supplementary planes), each with (= 2
16) code points. Thus the total size of the Unicode code space is 17 × = .
Definition
The notion of a code point is used for abstraction, to distinguish both:
* the number from an encoding as a sequence of
bits, and
* the abstract character from a particular graphical representation (
glyph
A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
).
This is because one may wish to make these distinctions to:
* encode a particular code space in different ways, or
* display a character via different glyphs.
For Unicode, the particular sequence of bits is called a ''
code unit'' – for the
UCS-4 encoding, any code point is encoded as 4-
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
(
octet
Octet may refer to:
Music
* Octet (music), ensemble consisting of eight instruments or voices, or composition written for such an ensemble
** String octet, a piece of music written for eight string instruments
*** Octet (Mendelssohn), 1825 com ...
)
binary number
A binary number is a number expressed in the base-2 numeral system or binary numeral system, a method of mathematical expression which uses only two symbols: typically "0" ( zero) and "1" (one).
The base-2 numeral system is a positional notati ...
s, while in the
UTF-8
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''.
UTF-8 is capable of e ...
encoding, different code points are encoded as sequences from one to four bytes long, forming a
self-synchronizing code
In coding theory, especially in telecommunications, a self-synchronizing code is a uniquely decodable code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a val ...
. See
comparison of Unicode encodings for details.
Code points are normally assigned to abstract
characters. An ''abstract'' character is not a graphical glyph but a unit of textual data. However, code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.
The distinction between a code point and the corresponding abstract character is not pronounced in Unicode, but is evident for many other encoding schemes, where numerous
code page
In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some c ...
s may exist for a single code space.
History
The concept of a code point is part of Unicode's solution to a difficult conundrum faced by character encoding developers in the 1980s. If they added more bits per character to accommodate larger character sets, that design decision would also constitute an unacceptable waste of then-scarce computing resources for
Latin script
The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern ...
users (who constituted the vast majority of computer users at the time), since those extra bits would always be zeroed out for such users.
The code point avoids this problem by breaking the old idea of a direct one-to-one correspondence between characters and particular sequences of bits.
See also
*
Combining character
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks (including combining accents).
Unicode al ...
*
Text-based (computing)
*
Replacement character
*
Unicode collation algorithm
References
External links
Codepoints.net, a site dedicated to all things characters, letters and Unicode
{{Unicode navigation
Character encoding