Binary Ordered Compression for Unicode (BOCU) is a

MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...

compatible Unicode compression scheme. BOCU-1 combines the wide applicability of

UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...

with the compactness of Standard Compression Scheme for Unicode (SCSU). This

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...

encoding In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...

is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note. For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently. Both SCSU and BOCU-1 are IANA registered charsets.

Details

All numbers in this section are hexadecimal, and all ranges are inclusive. Code points from U+0000 to U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, U+0021 through U+D7FF and U+E000 through U+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). The initial state is U+0040. The normalization mapping is as follows: The difference between the current code point and the normalized previous code point is encoded as follows: Each byte range is lexicographically ordered with the following thirteen byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequence FC 06 FF, coding for a difference of 1156B, is immediately followed by the byte sequence FC 10 01, coding for a difference of 1156C. Any ASCII input U+0000 to U+007F excluding space U+0020 resets the encoder to U+0040. Because the above-mentioned values cover line end code points U+000D and U+000A ''as is'' (0D 0A), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in

affects at most one code point, for SCSU it can affect the entire document. BOCU-1 offers a similar robustness also for input texts without the above-mentioned values with the special reset code 0xFF. When a decoder finds this octet it resets its state to U+0040 as for a line end. The use of 0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the ''binary order''. The optional use of a signature U+FEFF at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence FB EE 28, changes the initial state U+0040 to U+FEC0. In other words, the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF) could avoid this effect, but the BOCU-1 specification does not recommend this practice. In theory

UTF-1 UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for mult ...

and

could encode the original

UCS-4 UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode c ...

set with 31 bits up to 7FFFFFFF. BOCU-1 and

UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...

can encode the modern

set from U+0000 to U+10FFFF. Excluding the thirteen ''protected'' code points encoded as single octets BOCU-1 can use

256 - 13 = 243

octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining " modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. Note that the reset byte 0xFF is not ''protected'' and can occur as trail byte.

Patent

The general BOCU algorithm is covered by

United States Patent Under United States law, a patent is a right granted to the inventor of a (1) process, machine, article of manufacture, or composition of matter, (2) that is new, useful, and non-obvious. A patent is the right to exclude others, for a limited ...

#6,737,994, which also mentions the specific BOCU-1 implementation. IBM, which employed both of the inventors of BOCU-1 at the time it was created, states in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" must contact IBM to request a royalty-free license. BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with

intellectual property Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, cop ...

restrictions. By contrast, IBM also filed for a patent on UTF-EBCDIC, but it chose in that case to make the documentation and

encoding scheme In telecommunication, a line code is a pattern of voltage, current, or photons used to represent digital data transmitted down a communication channel or written to a storage medium. This repertoire of signals is usually called a constrained co ...

“freely available to anyone concerned towards making the transformation format as part of the UCS standards,” instead of requiring implementers to request a license.

Details

Patent

References

See also