This article describes and classifies the

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

characters that may validly appear in

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...

XML 1.0

Unicode code points in the following ranges are valid in XML 1.0 documents: * U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0; * U+0020–U+D7FF, U+E000–U+FFFD: this excludes ''some'' (not all) non-characters in the BMP (all

surrogates ''Surrogates'' is a 2009 American science fiction action film based on the 2005–2006 comic book series ''The Surrogates''. Directed by Jonathan Mostow, it stars Bruce Willis as Tom Greer, an FBI agent who ventures out into the real world to ...

, U+FFFE and U+FFFF are forbidden); * U+10000–U+10FFFF: this includes ''all'' code points in supplementary planes, including non-characters. The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: * U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control.

XML 1.1

Unicode code points in the following code point ranges are always valid in XML 1.1 documents: * U+0001–U+D7FF, U+E000–U+FFFD: this includes most C0 and C1 control characters, but excludes ''some'' (not all) non-characters in the BMP (surrogates, U+FFFE and U+FFFF are forbidden); * U+10000–U+10FFFF: this includes ''all'' code points in supplementary planes, including non-characters. The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.1 documents, and whose usage is restricted and highly discouraged: * U+0001–U+0008, U+000B–U+000C, U+000E–U+001F: this includes ''most'' (not all) C0 control characters * U+007F–U+0084, U+0086–U+009F: this includes a C0 control character, and all but one C1 control.

Characters allowed but discouraged

In addition, the following code points, even though they are valid in all XML 1.0 and XML 1.1 documents, are also restricted and discouraged in both versions of XML, as they are permanently assigned to non-characters in Unicode and ISO/IEC 10646. Some XML parsers may even signal them as invalid in their character set decoder, and XML documents containing them may not pass through some restricted interfaces or may not be interchangeable. These non-characters can still be encoded in standard UTFs (such as

UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...

) because these UTFs only restrict the code points assigned to surrogate non-characters: * U+FDD0–U+FDEF * U+1FFFE–U+1FFFF, U+2FFFE–U+2FFFF, U+3FFFE–U+3FFFF, U+4FFFE–U+4FFFF, U+5FFFE–U+5FFFF, U+6FFFE–U+6FFFF, U+7FFFE–U+7FFFF, U+8FFFE–U+8FFFF, U+9FFFE–U+9FFFF, U+AFFFE–U+AFFFF, U+BFFFE–U+BFFFF, U+CFFFE–U+CFFFF, U+DFFFE–U+DFFFF, U+EFFFE–U+EFFFF, U+FFFFE–U+FFFFF, U+10FFFE–U+10FFFF. Note that the code point U+0000, assigned to the null control character, is the only character encoded in Unicode and ISO/IEC 10646 that is always invalid in any XML 1.0 and 1.1 document. On the opposite, the code point U+0085 is a valid control character in Unicode and ISO/IEC 10646, as well as in XML 1.0 and XML 1.1 documents (in all contexts), and its usage is not discouraged (it is treated as whitespace in many XML contexts, or as a line-break control similar to U+000D and U+000A in preformatted texts in some XML applications).

Non-restricted characters

For these reasons, the non-restricted repertoire which can be used in all versions of XML and in all contexts (as permitted by the XML syntax) contains only code points that are permanently assigned to characters (excluding non-characters), or reserved for possible future encoding in Unicode and ISO/IEC 10646, and excludes the restricted repertoire, for better interoperability. They are: * U+0009, U+000A, U+000D: these are the only C0 control characters accepted in both XML 1.0 and XML 1.1 (they are treated as whitespaces or line-breaks in many contexts); * U+0020–U+007E: these are all the non-control characters in the Basic Latin block (the "graphic" subset of US-ASCII), and excludes the last C0 control; * U+0085: this is the only C1 control character accepted in both XML 1.0 and XML 1.1 (it is treated as whitespace or line-break in many contexts); * U+00A0–U+D7FF, U+E000–U+FDCF, U+FDF0–U+FFFD: this includes all the other characters in the BMP, excluding ''all'' non-characters (such as surrogates); * U+10000–U+1FFFD, U+20000–U+2FFFD, U+30000–U+3FFFD, U+40000–U+4FFFD, U+50000–U+5FFFD, U+60000–U+6FFFD, U+70000–U+7FFFD, U+80000–U+8FFFD, U+90000–U+9FFFD, U+A0000–U+AFFFD, U+B0000–U+BFFFD, U+C0000–U+CFFFD, U+D0000–U+DFFFD, U+E0000–U+EFFFD, U+F0000–U+FFFFD, U+100000–U+10FFFD: this excludes ''all'' non-characters in supplementary planes.

References

{{Reflist
''De litteris regentibus C1 quaestiones septem'' or ''Are C1 characters legal in XHTML 1.0?''
XML

XML 1.0

XML 1.1

Characters allowed but discouraged

Non-restricted characters

See also

References