Valid Characters In XML
   HOME
*





Valid Characters In XML
This article describes and classifies the Unicode characters that may validly appear in XML. XML 1.0 Unicode code points in the following ranges are valid in XML 1.0 documents: * U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0; * U+0020–U+D7FF, U+E000–U+FFFD: this excludes ''some'' (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden); * U+10000–U+10FFFF: this includes ''all'' code points in supplementary planes, including non-characters. The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: * U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. XML 1.1 Unicode code points in the following code point ranges are always valid in XML 1.1 documents: * U+0001–U+D7FF, U+E000–U+FFFD: this includes most C0 and C1 control characters, but exclud ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's writing systems. The standard, which is maintained by the Unicode Consortium, defines as of the current version (15.0) 149,186 characters covering 161 modern and historic script (Unicode), scripts, as well as symbols, emoji (including in colors), and non-visual control and formatting codes. Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, and most modern programming languages. The Unicode character repertoire is synchronized with Universal Coded Character Set, ISO/IEC 10646, each being code-for-code id ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


C0 And C1 Control Codes
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received. C0 codes are the range 00 HEX–1FHEX and the default C0 set was originally defined in ISO 646 (ASCII). C1 codes are the range 80HEX–9FHEX and the default C1 set was originally defined in ECMA-48 (harmonized later with ISO 6429). The ISO/IEC 2022 system of specifying control and graphic characters allows other C0 and C1 sets to be available for specialized applications, but they are rarely used. C0 controls ASCII defined 32 control characters, plus a necessary extra character for the DEL character, 7FHEX or 01111111BIN (needed to punch out all the holes on a paper tape and erase it). This large number of codes was desirable at the time, as multi ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Universal Character Set Characters
The Unicode Consortium and the ISO/IEC JTC 1/SC 2/ WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set ( UCS, official designation: ISO/IEC 10646), is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit— interchange— UCS-encoded text strings from one to another. Because it is a ''universal'' map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in ''mojibake'' if the wrong one is chosen. UCS has a potential capacity of over ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Plane (Unicode)
In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+''hhhhhh''). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly used characters. The higher planes 1 through 16 are called "supplementary planes". The last code point in Unicode is the last code point in plane 16, U+10FFFF. As of Unicode version , five of the planes have assigned code points (characters), and seven are named. The limit of 17 planes is due to UTF-16, which can encode 220 code points (16 planes) as pairs of words, plus the BMP as a single word. UTF-8 was designed with a much larger limit of 231 (2,147,483,648) code points (32,768 planes), and would still be able to encode 221 (2,097,152) code points (32 planes) even under the current limit of 4 bytes. The 17 planes can accommodate 1,114,1 ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-length encoding with partial ASCII compatibility which lacked some features including self-synchronizing code, self-synchronization and fully ASCII-compatible handling ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

U+0085
Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one. History In the mid-1800s, long before the advent of teleprinters and teletype machines, Morse code operators or telegraphists invented and used Morse code prosigns to encode white space text formatting in formal written text messages. In particular the Morse prosign (mnemonic reak ext) represented by the concatenation of literal textual Morse codes "B" and "T" characters sent without the normal inter-character spacing is used in Morse code to encode and indicate a ''new line'' or ''new section'' in a formal text message. Later, in the age of modern teleprinters, standardized character set control codes were developed to aid in whit ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


List Of XML And HTML Character Entity References
In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a ''character reference'', of which there are two types: a ''numeric character reference'' and a ''character entity reference''. This article lists the character entity references that are valid in HTML and XML documents. A character entity reference refers to the content of a named entity. An entity declaration is created by using the syntax in a Document Type Definition (DTD). Character reference overview A numeric character reference refers to a character by its Universal Character Set/Unicode ''code point'', and uses the format: :&#''nnnn''; or :&#x''hhhh''; where ''nnnn'' is the code point in decimal form, and ''hhhh'' is the code point in hexadecimal form. The x must be lowercase in XML documents. The ''nnnn'' ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Character Entity Reference
Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to Theophrastus Music * ''Characters'' (John Abercrombie album), 1977 * ''Character'' (Dark Tranquillity album), 2005 * ''Character'' (Julia Kent album), 2013 * ''Character'' (Rachael Sage album), 2020 * ''Characters'' (Stevie Wonder album), 1987 Types of entity * Character (arts), an agent within a work of art, including literature, drama, cinema, opera, etc. * Character sketch or character, a literary description of a character type * Game character (other), various types of characters in a video game or role playing game ** Player character, as above but who is controlled or whose actions are directly chosen by a player ** Non-player character, as above but not player-controlled, frequently abbreviated as NPC Other uses in ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Numeric Character Reference
A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document (for example, because they are international characters that do not fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents. Examples In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE In SGML ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]