Valid Characters In XML
   HOME





Valid Characters In XML
This article describes and classifies the Unicode characters that may validly appear in XML. XML 1.0 Unicode code points in the following ranges are valid in XML 1.0 documents: * U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0; * U+0020–U+D7FF, U+E000–U+FFFD: this excludes ''some'' (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden); * U+10000–U+10FFFF: this includes ''all'' code points in supplementary planes, including non-characters. The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: * U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. XML 1.1 Unicode code points in the following code point ranges are always valid in XML 1.1 documents: * U+0001–U+D7FF, U+E000–U+FFFD: this includes most C0 and C1 control characters, but exclud ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


picture info

Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Character (computing), characters and 168 script (Unicode), scripts used in various ordinary, literary, academic, and technical contexts. Unicode has largely supplanted the previous environment of a myriad of incompatible character sets used within different locales and on different computer architectures. The entire repertoire of these sets, plus many additional characters, were merged into the single Unicode set. Unicode is used to encode the vast majority of text on the Internet, including most web pages, and relevant Unicode support has become a common consideration in contemporary software development. Unicode is ultimately capable of encoding more than 1.1 million characters. The Unicode character repertoire is synchronized with Univers ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


C0 And C1 Control Codes
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received. C0 codes are the range 00 HEX–1FHEX and the default C0 set was originally defined in ISO 646 (ASCII). C1 codes are the range 80HEX–9FHEX and the default C1 set was originally defined in ECMA-48 (harmonized later with ISO 6429). The ISO/IEC 2022 system of specifying control and graphic characters allows other C0 and C1 sets to be available for specialized applications, but they are rarely used. C0 controls ASCII defines 32 control characters, plus the DEL character. This large number of codes was desirable at the time, as multi-byte controls would require implementation of a state machine in the terminal, which was very difficult with contemporary electr ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


picture info

Universal Character Set Characters
The Unicode Consortium and the ISO/IEC JTC 1/SC 2/ WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set ( UCS, official designation: ISO/ IEC 10646), is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit— interchange— UCS-encoded text strings from one to another. Because it is a ''universal'' map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in ''mojibake'' if the wrong one is chosen. UCS has a potential capacity o ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


picture info

Plane (Unicode)
In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+''hhhhhh''). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly used characters. The higher planes 1 through 16 are called "supplementary planes". The last code point in Unicode is the last code point in plane 16, U+10FFFF. As of Unicode version , five of the planes have assigned code points (characters), and seven are named. The limit of 17 planes is due to UTF-16, which can encode 220 code points (16 planes) as pairs of words, plus the BMP as a single word. UTF-8 was designed with a much larger limit of 231 (2,147,483,648) code points (32,768 planes), and would still be able to encode 221 (2,097,152) code points (32 planes) even under the current limit of 4 bytes. The 17 planes can accommodate 1,114 ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


picture info

UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,064 valid Unicode code points using a variable-width encoding of one to four one- byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file. Most software designed for any extended ASCII can read and write UTF-8, and this results in fewer internationalization issues than any alternative text encoding. UTF-8 is dominant for all countries/languages on the internet, with 99% global ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


picture info

U+0085
U, or u, is the twenty-first letter and the fifth vowel letter of the Latin alphabet, used in the modern English alphabet and the alphabets of other western European languages and others worldwide. Its name in English is ''u'' (pronounced ), plural ''ues''. Name In English, the name of the letter is the "long U" sound, pronounced . In most other languages, its name matches the letter's pronunciation in open syllables. History U derives from the Semitic waw, as does F, and later, Y, W, and V. Its oldest ancestor goes back to Egyptian hieroglyphs, and is probably from a hieroglyph of a mace or fowl, representing the sound or the sound . This was borrowed to Phoenician, where it represented the sound , and seldom the vowel . In Greek, two letters were adapted from the Phoenician waw. The letter was adapted, but split in two, with Digamma or wau being adapted to represent , and the second one being Upsilon , which was originally adapted to represent , later fronte ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


List Of XML And HTML Character Entity References
In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a ''character reference'', of which there are two types: a ''numeric character reference'' and a ''character entity reference''. This article lists the character entity references that are valid in HTML and XML documents. A character entity reference refers to the content of a named entity. An entity declaration is created in XML, SGML and HTML documents (before HTML5) by using the syntax in a document type definition (DTD). Character reference overview In HTML and XML, a ''numeric character reference'' refers to a character by its Universal Coded Character Set/Unicode ''code point'', and uses the format: &#x''hhhh''; or &#''nnnn''; where the x must be lowercase in XML documents, ''hhhh'' is the code po ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]


Numeric Character Reference
A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document (for example, because they are international characters that do not fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents. Examples In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE In SG ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   [Amazon]