Binary Ordered Compression For Unicode

	Binary Ordered Compression For Unicode Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note. For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently. Both SCSU and BOCU-1 are IANA registered charsets. Details All numbers in this section are hexadecimal, and all ranges a ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP). The MIME standard is specified in a series of requests for comments: , , , , and . The integration with SMTP email is specified in and . Although the MIME formalism was designed mainly for SMTP, its content types are also important in other communication protocols. In the HyperText Transfer Protocol (HTTP) for the World Wide Web, servers insert a MIME header field at the beginning of any Web transmission. Clients ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Lexicographical Order In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set. There are several variants and generalizations of the lexicographical ordering. One variant applies to sequences of different lengths by comparing the lengths of the sequences before considering their elements. Another variant, widely used in combinatorics, orders subsets of a given finite set by assigning a total order to the finite set, and converting subsets into increasing sequences, to which the lexicographical order is applied. A generalization defines an order on a Cartesian product of partially ordered sets; this order is a total order if and only if all factors of the Cartesian product are totally ordered. Motivation and definition The words in a lexicon (the set of words used in some language) have a ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	International Components For Unicode International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. ICU provides the following services: Unicode text handling, full character properties, and character set conversions; Unicode regular expressions; full Unicode sets; character, word, and line boundaries; language-sensitive collation and searching; normalization, upper and lowercase conversion, and script transliterations; comprehensive locale data and resource bundle architecture via the Common Locale Data Repository (CLDR); multiple calendars and time zones; and rule-based formatting and parsing of dates ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Character Encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmitted, and Computing, transformed using Digital electronics, digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "Character Map (Windows), character map". Early character codes associated with the optical or electrical Telegraphy, telegraph could only represent a subset of the characters used in written languages, sometimes restricted to Letter case, upper case letters, Numeral system, numerals and some punctuation only. The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages. Character enc ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	UTF-EBCDIC UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to five one-byte (8-bit) code units (in contrast to a maximum of four for UTF-8). It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16. To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Intellectual Property Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, copyrights, trademarks, and trade secrets. The modern concept of intellectual property developed in England in the 17th and 18th centuries. The term "intellectual property" began to be used in the 19th century, though it was not until the late 20th century that intellectual property became commonplace in the majority of the world's legal systems."property as a common descriptor of the field probably traces to the foundation of the World Intellectual Property Organization (WIPO) by the United Nations." in Mark A. Lemley''Property, Intellectual Property, and Free Riding'', Texas Law Review, 2005, Vol. 83:1031, page 1033, footnote 4. The main purpose of intellectual property law is to encourage the creation of a wide variety of intellectual goo ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	United States Patent Law Under United States law, a patent is a right granted to the inventor of a (1) process, machine, article of manufacture, or composition of matter, (2) that is new, useful, and non-obvious. A patent is the right to exclude others, for a limited time (usually, 20 years) from profiting of a patented technology without the consent of the patent-holder. Specifically, it is the right to exclude others from: making, using, selling, offering for sale, importing, inducing others to infringe, applying for an FDA approval, and/or offering a product specially adapted for practice of the patent. United States patent law is codified in Title 35 of the United States Code, and authorized by the U.S. Constitution, in Article One, section 8, clause 8, which states: Patent law is designed to encourage inventors to disclose their new technology to the world by offering the incentive of a limited-time monopoly on the technology. For U.S. utility patents, this limited-time term of patent i ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Modulo Operation In computing, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another (called the '' modulus'' of the operation). Given two positive numbers and , modulo (often abbreviated as ) is the remainder of the Euclidean division of by , where is the dividend and is the divisor. For example, the expression "5 mod 2" would evaluate to 1, because 5 divided by 2 has a quotient of 2 and a remainder of 1, while "9 mod 3" would evaluate to 0, because 9 divided by 3 has a quotient of 3 and a remainder of 0; there is nothing to subtract from 9 after multiplying 3 times 3. Although typically performed with and both being integers, many computing systems now allow other types of numeric operands. The range of values for an integer modulo operation of is 0 to inclusive ( mod 1 is always 0; is undefined, possibly resulting in a division by zero error in some programming languages). See Modular arithmetic for an older and related c ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-width encoding, variable-length, as code points are encoded with one or two 16-bit ''code units''. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed. UTF-16 is used by systems such as the Microsoft Windows API, the Java programming language and JavaScript/ECMAScript. It is also sometimes used for plain text and word-processing data files on Microsoft Windows. It is rarely used for files on Unix-like systems. UTF-16 is often claimed to be more space-efficient than UTF-8 for East Asian languages, since it uses two bytes for characters that take 3 bytes in UTF-8. Since real text contains many s ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Universal Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added. The UCS has over 1.1 million possible code points available for use/allocation, but only the first 65,536, which is the Basic Multilingual Plane (BMP), had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2006 that all software sold in its jurisdiction would have to support GB 18030. This required software intended for sale in the PRC to move beyond the BMP. The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimise conflicts with other encoding forms. The ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	UTF-1 UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8. Design Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or ''five'' bytes. All ASCII code points are a single byte (the code points through are also single bytes). UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a by ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Byte-order Mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of the text stream in the cases of 16-bit and 32-bit encodings; * The fact that the text stream's encoding is Unicode, to a high level of confidence; * Which Unicode character encoding is used. BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream. Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For the 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM is encoded in the same scheme as the rest of the document and becomes a Unicode code point if its bytes are swapped. Hence, the process accessi ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]