HOME

TheInfoList



In
character encoding In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both computer hardware , hardware and softw ...
terminology, a code point or code position is any of the numerical values that make up the ''codespace''. Many code points represent single characters but they can also have other meanings, such as for formatting. For example, the character encoding scheme
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Most mod ...
comprises 128 code points in the range 0 hex to 7Fhex,
Extended ASCII Extended ASCII (EASCII or high ASCII) character encodings are eight-bit or larger encodings that include the standard seven- bit ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding s ...
comprises 256 code points in the range 0hex to FFhex, and
Unicode Unicode, formally the Unicode Standard, is an information technology standard Standard may refer to: Flags * Colours, standards and guidons * Standard (flag), a type of flag used for personal identification Norm, convention or requiremen ...

Unicode
comprises code points in the range 0hex to 10FFFFhex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with (= 216) code points. Thus the total size of the Unicode code space is 17 ×  = .


Definition

The notion of a code point is used for abstraction, to distinguish both: * the number from an encoding as a sequence of
bit The bit is a basic unit of information in computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both ...
s, and * the abstract character from a particular graphical representation (
glyph In typography File:metal movable type.jpg, 225px, Movable type being assembled on a composing stick using pieces that are stored in the type case shown below it Typography is the art and technique of typesetting, arranging type to make wr ...
). This is because one may wish to make these distinctions to: * encode a particular code space in different ways, or * display a character via different glyphs. For Unicode, the particular sequence of bits is called a ''
code unit In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both computer hardware , hardware and softw ...
'' – for the
UCS-4 UTF-32 (32-bit The bit is a basic unit of information in computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and deve ...
encoding, any code point is encoded as 4-
byte The byte is a unit of digital information that most commonly consists of eight bit The bit is a basic unit of information in computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It ...
(
octet Octet may refer to: Music * Octet (music), ensemble consisting of eight instruments or voices, or composition written for such an ensemble ** String octet, a piece of music written for eight string instruments *** Octet (Mendelssohn), 1825 compos ...
)
binary number In mathematics and digital electronics, a binary number is a number expressed in the base-2 numeral system or binary numeral system, which uses only two symbols: typically "0" (zero) and "1" (one). The base-2 numeral system is a positional notati ...
s, while in the
UTF-8 UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of en ...
encoding, different code points are encoded as sequences from one to four bytes long, forming a
self-synchronizing code In coding theory Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression In signal processing, data compression, source coding, or bit-rate reduction ...
. See
comparison of Unicode encodings This article compares Unicode Unicode is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's writi ...
for details. Code points are normally assigned to abstract
characters Character(s) may refer to: Arts, entertainment, and media Literature * Character (novel), ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * Characters (Theophrastus), ''Characters'' (Theophrastus), a classical Greek set of char ...
. An ''abstract'' character is not a graphical glyph but a unit of textual data. However, code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions. The distinction between a code point and the corresponding abstract character is not pronounced in Unicode, but is evident for many other encoding schemes, where numerous
code page In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both computer hardware , hardware and software ...
s may exist for a single code space.


History

The concept of a code point is part of Unicode's solution to a difficult conundrum faced by character encoding developers in the 1980s. If they added more bits per character to accommodate larger character sets, that design decision would also constitute an unacceptable waste of then-scarce computing resources for
Latin script Latin script, also known as Roman script, is a set of graphic signs (Writing system#General properties, script) based on the letters of the classical Latin alphabet. This is derived from a form of the Cumae alphabet, Cumaean Greek version of the ...

Latin script
users (who constituted the vast majority of computer users at the time), since those extra bits would always be zeroed out for such users. The code point avoids this problem by breaking the old idea of a direct one-to-one correspondence between characters and particular sequences of bits.


See also

*
Combining character In digital typography, combining characters are Character (computing), characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritic, diacritical marks (including co ...
*
Text-based (computing) In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both computer hardware , hardware and sof ...
*
Replacement character Specials is a short Unicode Unicode is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's writi ...

Replacement character
*
Unicode collation algorithm The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system A writing system is a method of visually ...


References


External links


Codepoints.net, a site dedicated to all things characters, letters and Unicode
{{Unicode navigation Character encoding