UTF-1 is a method of transforming
ISO/IEC 10646/
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
into a stream of
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
s. Its design does not provide
self-synchronization, which makes searching for
substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by
UTF-8
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''.
UTF-8 is capable of ...
.
Design
Similar to
UTF-8
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''.
UTF-8 is capable of ...
, UTF-1 is a
variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings a ...
that is backwards-compatible with
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
. Every
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
code point
In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—bu ...
is represented by either a single byte, or a sequence of two, three, or ''five'' bytes. All ASCII code points are a single byte (the code points through are also single bytes).
UTF-1 does not use the
C0 and C1 control codes
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a curso ...
or the space character in multi-byte encodings: a byte in the range 0–
0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 ''protected'' characters tried to be
ISO/IEC 2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
compatible.
UTF-1 uses "
modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (; ).
BOCU-1 protects only the minimal set required for
MIME
Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Messa ...
-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original
Universal Character Set
The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/ IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), ...
(
UCS-4
UTF-32 (32- bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicod ...
), and the last entry in this table shows this original final code point.
See also
*
Comparison of Unicode encodings
*
Universal Character Set
The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/ IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), ...
References
*
*
*
*
{{DEFAULTSORT:Utf-01
Unicode Transformation Formats