UTF-7
   HOME

TheInfoList



OR:

UTF-7 (7-
bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
Unicode Transformation Format Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
) is an obsolete variable-length character encoding for representing
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
text using a stream of
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
characters. It was originally intended to provide a means of encoding
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
text for use in
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
E-mail Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
messages that was more efficient than the combination of
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
with
quoted-printable Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters (alphanumeric and the equals sign =) to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Hist ...
. UTF-7 (according to its RFC) isn't a "
Unicode Transformation Format Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
", as the definition can only encode code points in the BMP (the first 65536 Unicode code points, which does not include
emojis An emoji ( ; plural emoji or emojis) is a pictogram, logogram, ideogram or smiley embedded in text and used in electronic messages and web pages. The primary function of emoji is to fill in emotional cues otherwise missing from typed conversa ...
and many other characters). However if a UTF-7 translator is to/from
UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
then it can (and probably does) encode each surrogate half as though it was a 16-bit code point, and thus can encode all code points. It is unclear if other UTF-7 software (such as translators to UTF-32 or UTF-8) support this. UTF-7 has never has been an official standard of the
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intenti ...
. It is known to have security issues, which is why software has been changed to disable its use. It is prohibited in
HTML 5 HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML L ...
.


Motivation

MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
, the modern standard of E-mail format, forbids encoding of headers using byte values above the ASCII range. Although MIME allows encoding the message body in various
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
s (broader than ASCII), the underlying transmission infrastructure (
SMTP The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typical ...
, the main E-mail transfer standard) is still not guaranteed to be
8-bit clean ''8-bit clean'' is an attribute of computer systems, communication channels, and other devices and software, that handle 8-bit character encodings correctly. Such encoding include the ISO 8859 series and the UTF-8 encoding of Unicode. History ...
. Therefore, a non-trivial content transfer encoding has to be applied in case of doubt. Unfortunately
base64 In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits. Common to all bina ...
has a disadvantage of making even
US-ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
characters unreadable in non-MIME clients. On the other hand, UTF-8 combined with
quoted-printable Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters (alphanumeric and the equals sign =) to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Hist ...
produces a very size-inefficient format requiring 6–9 bytes for non-ASCII characters from the BMP and 12 bytes for characters outside the BMP. Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail without using an underlying MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either
quoted-printable Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters (alphanumeric and the equals sign =) to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Hist ...
or
base64 In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits. Common to all bina ...
, UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable (or its variant, the RFC 2047/1522 ?Q?-encoding of headers). UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the now defunct
Internet Mail Consortium The Internet Mail Consortium (IMC) was an organization between 1996 and 2002 that claimed to be the only international organization focused on cooperatively managing and promoting the rapidly expanding world of electronic mail on the Internet. The ...
recommended against its use.
8BITMIME The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typical ...
has also been introduced, which reduces the need to encode message bodies in a 7-bit format. A modified form of UTF-7 (sometimes dubbed 'mUTF-7') is currently used in the
IMAP In computing, the Internet Message Access Protocol (IMAP) is an Internet standard protocol used by email clients to retrieve email messages from a mail server over a TCP/IP connection. IMAP is defined by . IMAP was designed with the goal of per ...
e-mail retrieval protocol for mailbox names.


Description

UTF-7 was first proposed as an experimental protocol in RFC 1642, ''A Mail-Safe Transformation Format of Unicode''. This
RFC RFC may refer to: Computing * Request for Comments, a memorandum on Internet standards * Request for change, change management * Remote Function Call, in SAP computer systems * Rhye's and Fall of Civilization, a modification for Sid Meier's Civ ...
has been made obsolete by RFC 2152, an informational RFC which never became a standard. As RFC 2152 clearly states, the RFC "does not specify an Internet standard of any kind". Despite this, RFC 2152 is quoted as the definition of UTF-7 in the IANA's list of charsets. Neither is UTF-7 a Unicode Standard. ''The Unicode Standard 5.0'' only lists UTF-8, UTF-16 and UTF-32. There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7. Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range –U+007E except ~ \ + and space (the characters and being excluded due to being redefined in "variants of ASCII" such as
JIS-Roman Code page 895 (CCSID 895) is a 7-bit character set and is Japan's national ISO 646 variant. It is the Roman set (first or left half) of the JIS X 0201 (formerly JIS C 6220) Japanese Standard and is variously called Japan 7-Bit Latin, JISCII, JIS ...
). Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields. Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail. The plus sign (+) ''may'' be encoded as +-. Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into two surrogates), and then in
modified Base64 In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits. Common to all bina ...
. The start of these blocks of modified Base64-encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII
hyphen-minus The hyphen-minus is the most commonly used type of hyphen, widely used in digital documents. It is the only character that looks like a minus sign or a dash in many character sets such as ASCII or on most keyboards, so it is also used as such. ...
) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.


Examples

* "Hello, World!" is encoded as "Hello, World+ACE-" * "1 + 1 = 2" is encoded as "1 +- 1 +AD0- 2" * "£1" is encoded as "+AKM-1". The Unicode code point for the
pound sign The pound sign is the symbol for the pound unit of sterling – the currency of the United Kingdom and previously of Great Britain and of the Kingdom of England. The same symbol is used for other currencies called pound, such as the Gibralta ...
is U+00A3 which converts into
modified Base64 In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits. Common to all bina ...
as in the table below. There are two bits left over, which are padded to 0.


Algorithm for encoding and decoding


Encoding

First, an encoder must decide which characters to represent directly in ASCII form, which + has to be escaped as +-, and which to place in blocks of Unicode characters. The expansion cost of UTF-7 can be high: for example, the character sequence U+10FFFF U+0077 U+10FFFF is 9 bytes in UTF-8, but 17 bytes in UTF-7. (At worst, treating every codepoint as a sequence in its own right produces the maximum expansion of 5x, e.g. when encoding @@ as +AEA-+AEA-.) Each Unicode sequence must be encoded using the following procedure, then surrounded by the appropriate delimiters. Using the £† (U+00A3 U+2020) character sequence as an example:


Decoding

First an encoded data must be separated into plain ASCII text chunks (including +es followed by a dash) and nonempty Unicode blocks as mentioned in the description section. Once this is done, each Unicode block must be decoded with the following procedure (using the result of the encoding example above as our example) #Express each Base64 code as the bit sequence it represents:
AKMgIA → 000000 001010 001100 100000 001000 000000 #Regroup the binary into groups of sixteen bits, starting from the left:
000000 001010 001100 100000 001000 000000 → 0000000010100011 0010000000100000 0000 #If there is an incomplete group at the end containing only zeros, discard it (if the incomplete group contains any ones, the code is invalid):
0000000010100011 0010000000100000 #Each group of 16 bits is a character's Unicode (UTF-16) number and can be expressed in other forms:
0000 0000 1010 0011 ≡ 0x00A3 ≡ 16310


Byte order mark

A byte order mark (BOM) is an optional special byte sequence at the very start of a stream or file that, without being data itself, indicates the encoding used for the data that follows; it can be used in the absence of metadata that denotes the encoding. For a given encoding scheme, it's that scheme's representation of Unicode code point U+FEFF. While it's typically a single, fixed byte sequence, in UTF-7 four variations may appear, because the last 2 bits of the 4th byte of the UTF-7 encoding of U+FEFF belong to the ''following'' character, resulting in 4 possible bit patterns and therefore 4 different possible bytes in the 4th position. See the UTF-7 entry in the table of Unicode byte order marks.


Security

UTF-7 allows multiple representations of the same source string. In particular, ASCII characters can be represented as part of Unicode blocks. As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7. Older versions of
Internet Explorer Internet Explorer (formerly Microsoft Internet Explorer and Windows Internet Explorer, commonly abbreviated IE or MSIE) is a series of graphical user interface, graphical web browsers developed by Microsoft which was used in the Microsoft Wind ...
can be tricked into interpreting the page as UTF-7. This can be used for a
cross-site scripting Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may ...
attack as the < and > marks can be encoded as +ADw- and +AD4- in UTF-7, which most validators let through as simple text. UTF-7 is considered obsolete, at least for Microsoft software (.NET), with code paths previously supporting it intentionally broken (to prevent security issues) in .NET 5, in 2020.


References


See also

* Comparison of Unicode encodings {{DEFAULTSORT:Utf-07 Character encoding Unicode Transformation Formats