HOME

TheInfoList



OR:

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the channel does not allow binary data (such as email or NNTP) or is not 8-bit clean.
PGP PGP or Pgp may refer to: Science and technology * P-glycoprotein, a type of protein * Pelvic girdle pain, a pregnancy discomfort * Personal Genome Project, to sequence genomes and medical records * Pretty Good Privacy, a computer program for the ...
documentation () uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.


Overview

The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocols that were designed to carry only English language human-readable text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain
whitespace White space or whitespace may refer to: Technology * Whitespace characters, characters in computing that represent horizontal or vertical space * White spaces (radio), allocated but locally unused radio frequencies * TV White Space Database, a mec ...
. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.


Description

The ASCII text-encoding standard uses 7 bits to encode characters. With this it is possible to encode 128 (i.e. 27) unique values (0–127) to represent the alphabetic, numeric, and punctuation characters commonly used in English, plus a selection of
control codes In computing and telecommunication, a control character or non-printing character (NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than the ...
which do not represent printable characters. For example, the capital letter ''A'' is ASCII character 65, the numeral ''2'' is ASCII 50, the character ''}'' is ASCII 125, and the metacharacter ''carriage return'' is ASCII 13. Systems based on ASCII use seven bits to represent these values digitally. In contrast, most computers store data in memory organized in eight-bit bytes. Files that contain machine-executable code and non-textual data typically contain all 256 possible eight-bit byte values. Many computer programs came to rely on this distinction between seven-bit ''text'' and eight-bit ''binary'' data, and would not function properly if non-ASCII characters appeared in data that was expected to include only ASCII text. For example, if the value of the eighth bit is not preserved, the program might interpret a byte value above 127 as a flag telling it to perform some function. It is often desirable, however, to be able to send non-textual data through text-based systems, such as when one might attach an image file to an e-mail message. To accomplish this, the data is encoded in some way, such that eight-bit data is encoded into seven-bit ASCII characters (generally using only alphanumeric and punctuation characters—the
ASCII printable characters ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
). Upon safe arrival at its destination, it is then decoded back to its eight-bit form. This process is referred to as binary to text encoding. Many programs perform this conversion to allow for data-transport, such as
PGP PGP or Pgp may refer to: Science and technology * P-glycoprotein, a type of protein * Pelvic girdle pain, a pregnancy discomfort * Personal Genome Project, to sequence genomes and medical records * Pretty Good Privacy, a computer program for the ...
and
GNU Privacy Guard GNU Privacy Guard (GnuPG or GPG) is a free-software replacement for Symantec's PGP cryptographic software suite. The software is compliant with RFC 4880, the IETF standards-track specification of OpenPGP. Modern versions of PGP are interoper ...
(GPG).


Encoding plain text

Binary-to-text encoding methods are also used as a mechanism for encoding plain text. For example: * Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some cannot even handle every printable ASCII character. * Other systems have limits on the number of characters that may appear between line breaks, such as the "1000 characters per line" limit of some SMTP software, as allowed by . * Still others add headers or trailers to the text. * A few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format. By using a binary-to-text encoding on messages that are already plain text, then decoding on the other end, one can make such systems appear to be completely transparent. This is sometimes referred to as 'ASCII armoring'. For example, the ViewState component of ASP.NET uses base64 encoding to safely transmit text via HTTP POST, in order to avoid delimiter collision.


Encoding standards

The table below compares the most used forms of binary-to-text encodings. The efficiency listed is the ratio between number of bits in the input and the number of bits in the encoded output. The 95
isprint C character classification is an operation provided by a group of functions in the ANSI C Standard Library for the C programming language. These functions are used to test characters for membership in a particular class of characters, such as al ...
codes 32 to 126 are known as the
ASCII printable characters ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
. Some older and today uncommon formats include BOO, BTOA, and USR encoding. Most of these encodings generate text containing only a subset of all ASCII printable characters: for example, the base64 encoding generates text that only contains upper case and lower case letters, (A–Z, a–z), numerals (0–9), and the "+", "/", and "=" symbols. Some of these encoding (quoted-printable and percent encoding) are based on a set of allowed characters and a single
escape character In computing and telecommunication, an escape character is a character (computing), character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharac ...
. The allowed characters are left unchanged, while all other characters are converted into a string starting with the escape character. This kind of conversion allows the resulting text to be almost readable, in that letters and digits are part of the allowed characters, and are therefore left as they are in the encoded text. These encodings produce the shortest plain ASCII output for input that is mostly printable ASCII. Some other encodings ( base64, uuencoding) are based on mapping all possible sequences of six
bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented a ...
s into different printable characters. Since there are more than 26 = 64 printable characters, this is possible. A given sequence of bytes is translated by viewing it as stream of bits, breaking this stream in chunks of six bits and generating the sequence of corresponding characters. The different encodings differ in the mapping between sequences of bits and characters and in how the resulting text is formatted. Some encodings (the original version of BinHex and the recommended encoding for
CipherSaber CipherSaber is a simple symmetric encryption protocol based on the RC4 stream cipher. Its goals are both technical and political: it gives reasonably strong protection of message confidentiality, yet it's designed to be simple enough that even no ...
) use four bits instead of six, mapping all possible sequences of 4 bits onto the 16 standard
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
digits. Using 4 bits per encoded character leads to a 50% longer output than base64, but simplifies encoding and decoding—expanding each byte in the source independently to two encoded bytes is simpler than base64's expanding 3 source bytes to 4 encoded bytes. Out of PETSCII's first 192 codes, 164 have visible representations when quoted: 5 (white), 17–20 and 28–31 (colors and cursor controls), 32–90 (ascii equivalent), 91–127 (graphics), 129 (orange), 133–140 (function keys), 144–159 (colors and cursor controls), and 160–192 (graphics).http://sta.c64.org/cbm64pet.html et al This theoretically permits encodings, such as base128, between PETSCII-speaking machines.


See also

*
Alphanumeric shellcode In hacking, a shellcode is a small piece of code used as the payload in the exploitation of a software vulnerability. It is called "shellcode" because it typically starts a command shell from which the attacker can control the compromised mac ...
* Character encoding * Compiling * Computer number format *
Geocode A geocode is a code that represents a geographic entity (location or object). It is a unique identifier of the entity, to distinguish it from others in a finite set of geographic entities. In general the ''geocode'' is a human-readable and ...
* Numeral systems, listed by notation type * Punycode


Notes


References

{{Reflist Computer file formats Character encoding