uuencoding is a form of
binary-to-text encoding
A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the channel does not allow binary dat ...
that originated in the
Unix
Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
programs uuencode and uudecode written by
Mary Ann Horton
Mary Ann Horton (born Mark R. Horton, on November 21, 1955), is a Usenet and Internet pioneer. Horton contributed to Berkeley UNIX (BSD), including the vi editor and terminfo database, (see Acknowlegments section at end of file) created the firs ...
at UC Berkeley in 1980, for
encoding
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that t ...
data for transmission in email systems.
The name "uuencoding" is derived from
Unix-to-Unix Copy
UUCP is an acronym of Unix-to-Unix Copy. The term generally refers to a suite of computer programs and protocols allowing remote execution of commands and transfer of files, email and netnews between computers.
A command named is one of the pr ...
, i.e. "Unix-to-Unix encoding" is a safe encoding for the transfer of arbitrary files from one Unix system to another Unix system but without guarantee that the intervening links would all be Unix systems. Since an email message might be forwarded through or to computers with different
character set
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
s or through transports which are not
8-bit clean
''8-bit clean'' is an attribute of computer systems, communication channels, and other devices and software, that handle 8-bit character encodings correctly. Such encoding include the ISO 8859 series and the UTF-8 encoding of Unicode.
History
...
, or handled by programs that are not 8-bit clean, forwarding a binary file via email might cause it to be corrupted. By encoding such data into a character subset common to most character sets, the encoded form of such data files was unlikely to be "translated" or corrupted, and would thus arrive intact and unchanged at the destination. The program uudecode reverses the effect of uuencode, recreating the original binary file exactly. uuencode/decode became popular for sending binary (and especially compressed) files by email and posting to
Usenet
Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it was ...
newsgroups, etc.
It has now been largely replaced by
MIME
Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
and
yEnc
yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often (if ea ...
. With MIME, files that might have been uuencoded are instead transferred with
base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits.
Common to all bina ...
encoding.
Encoded format
A uuencoded file starts with a header line of the form:
begin
is the file's Unix file permissions as three octal digits (e.g. 644, 744). This is typically only significant to unix-like
A Unix-like (sometimes referred to as UN*X or *nix) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Unix-li ...
operating systems.
is the file name to be used when recreating the binary data.
signifies a newline
Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
character, used to terminate each line.
Each data line uses the format:
is a character indicating the number of data bytes which have been encoded on that line. This is an ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
character determined by adding 32 to the actual byte count, with the sole exception of a grave accent
The grave accent () ( or ) is a diacritical mark used to varying degrees in French, Dutch, Portuguese, Italian and many other western European languages, as well as for a few unusual uses in English. It is also used in other languages using t ...
"`" (ASCII code 96) signifying zero bytes. All data lines except the last (if the data length was not divisible by 45), have 45 bytes of encoded data (60 characters after encoding). Therefore, the vast majority of length values are 'M', (32 + 45 = ASCII code 77 or "M").
are encoded characters. See Formatting Mechanism for more details on the actual implementation.
The file ends with two lines:
`
end
The second to last line is also a character indicating the line length with the grave accent signifying zero bytes.
As a complete file, the uuencoded output for a plain text file named cat.txt containing only the characters ''Cat'' would be
begin 644 cat.txt
#0V%T
`
end
The begin line is a standard uuencode header; the '#' indicates that its line encodes three characters; the last two lines appear at the end of all uuencoded files.
Formatting mechanism
The mechanism of uuencoding
repeats the following for every 3 bytes, encoding them into 4 printable characters, each character representing a radix-64
In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits.
Common to all bina ...
numerical digit
A numerical digit (often shortened to just digit) is a single symbol used alone (such as "2") or in combinations (such as "25"), to represent numbers in a positional numeral system. The name "digit" comes from the fact that the ten digits (Latin ...
:
# Start with 3 bytes
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
from the source, 24 bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s in total.
# Split into 4 6-bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
groupings, each representing a value in the range 0 to 63: bits (00-05), (06-11), (12-17) and (18-23).
# Add 32 to each of the values. With the addition of 32 this means that the possible results can be between 32 (" " space) and 95 ("_" underline
An underscore, ; also called an underline, low line, or low dash; is a line drawn under a segment of text. In proofreading, underscoring is a convention that says "set this text in italic type", traditionally used on manuscript or typescript as ...
). 96 ("`" grave accent
The grave accent () ( or ) is a diacritical mark used to varying degrees in French, Dutch, Portuguese, Italian and many other western European languages, as well as for a few unusual uses in English. It is also used in other languages using t ...
) as the "special character" is a logical extension of this range. Despite space character being documented as the encoding for value of 0, implementations, such as GNU sharutils, actually use the grave accent
The grave accent () ( or ) is a diacritical mark used to varying degrees in French, Dutch, Portuguese, Italian and many other western European languages, as well as for a few unusual uses in English. It is also used in other languages using t ...
character to encode zeros in the body of the file as well, never using space.
# Output the ASCII equivalent of these numbers.
If the source length is not divisible by 3, then the last 4-byte section will contain padding
Padding is thin cushioned material sometimes added to clothes. Padding may also be referred to as batting when used as a layer in lining quilts or as a packaging or stuffing material. When padding is used in clothes, it is often done in an attempt ...
bytes to make it cleanly divisible. These bytes are subtracted from the line's
so that the decoder does not append unwanted characters to the file.
uudecoding
is reverse of the above, subtract 32 from each character's ASCII code (modulo
In computing, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another (called the '' modulus'' of the operation).
Given two positive numbers and , modulo (often abbreviated as ) is t ...
64 to account for the grave accent
The grave accent () ( or ) is a diacritical mark used to varying degrees in French, Dutch, Portuguese, Italian and many other western European languages, as well as for a few unusual uses in English. It is also used in other languages using t ...
usage) to get a 6-bit value, concatenate 4 6-bit groups to get 24 bits, then output 3 bytes.
The encoding process is demonstrated by this table, which shows the derivation of the above encoding for "Cat".
uuencode table
The following table shows the conversion of the decimal value of the 6-bit fields obtained during the conversion process and their corresponding ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
character output code and character.
Note that some encoders might produce space (code 32) instead of grave accent ("`", code 96), while some decoders might refuse to decode data containing space.
Example
The following is an example of uuencoding a one-line text file. In this example, %0D is the byte representation for carriage return
A carriage return, sometimes known as a cartridge return and often shortened to CR, or return, is a control character or mechanism used to reset a device's position to the beginning of a line of text. It is closely associated with the line feed a ...
(CR), and %0A is the byte representation for line feed
Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
(LF).
;file
File Name = wikipedia-url.txt
File Contents = http://www.wikipedia.org%0D%0A
;uuencoding
begin 644 wikipedia-url.txt
::'1T<#HO+W=W=RYW:6MI<&5D:6$N;W)G#0H`
`
end
Forks (file, resource)
Unix traditionally has a single fork
In cutlery or kitchenware, a fork (from la, furca 'pitchfork') is a utensil, now usually made of metal, whose long handle terminates in a head that branches into several narrow and often slightly curved tines with which one can spear foods ei ...
where file data is stored. However, some file systems support multiple forks associated with a single file. For example, classic Mac OS HFS HFS may refer to:
Computing
* Hardware functionality scan, a security mechanism used in Microsoft Windows operating systems
* Hierarchical File System, a file system used by Apple Macintosh computers
* Hierarchical File System (IBM MVS), used MV ...
supported a data fork and a ''resource fork The resource fork is a fork (file system), fork or section of a computer file, file on Apple Inc., Apple's classic Mac OS operating system, which was also carried over to the modern macOS for compatibility, used to store structured data along with t ...
''. Mac OS HFS+
HFS Plus or HFS+ (also known as Mac OS Extended or HFS Extended) is a journaling file system developed by Apple Inc. It replaced the Hierarchical File System (HFS) as the primary file system of Apple computers with the 1998 release of Mac OS 8.1 ...
supports multiple forks, as does Microsoft Windows NTFS
New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred fil ...
''alternate data streams''. Most uucoding tools will only handle data from the primary data fork, which can result in a loss of information when encoding/decoding (for example, Windows NTFS file comments are kept in a different fork). Some tools (like the classic Mac OS application UUTool) solved the problem by concatenating the different forks into one file and differentiating them by file name.
Relation to xxencode, Base64, and Ascii85
Despite its limited range of characters, uuencoded data is sometimes corrupted on passage through certain computers using non-ASCII character sets such as EBCDIC
Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six- ...
. One attempt to solve the problem was the xxencode format, which used only alphanumeric characters and the plus and minus symbols. More common today is the Base64 format, which is based on the same concept of alphanumeric
Alphanumericals or alphanumeric characters are a combination of alphabetical and numerical characters. More specifically, they are the collection of Latin letters and Arabic digits. An alphanumeric code is an identifier made of alphanumeric ch ...
-only as opposed to ASCII 32–95. All three formats use 6 bits (64 different characters) to represent their input data.
Base64 can also be generated by the uuencode program and is similar in format, with the exception of the actual character translation:
The header is changed to
begin-base64 <mode> <file>
the trailer becomes
and lines between are encoded with characters chosen from
ABCDEFGHIJKLMNOP
QRSTUVWXYZabcdef
ghijklmnopqrstuv
wxyz0123456789+/
Another alternative is Ascii85
Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data (making the encoded size larger than the original, assuming e ...
, which encodes four binary characters in five ASCII characters. Ascii85 is used in PostScript
PostScript (PS) is a page description language in the electronic publishing and desktop publishing realm. It is a dynamically typed, concatenative programming language. It was created at Adobe Systems by John Warnock, Charles Geschke, Doug Br ...
and PDF
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
formats.
Disadvantages
uuencoding takes 3 pre-formatted bytes
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
and turns them into 4 and also adds begin/end tags, filename, and delimiters
A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts ...
. This adds at least 33% data overhead compared to the source alone, though this can be at least somewhat compensated for by compressing the file before uuencoding it.
Support in languages
Python
The Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
language supports uuencoding using the codecs module with the codec "uu":
For Python 2 ''(deprecated/sunset as of January 1st 2020)'':
$ python -c 'print "Cat".encode("uu")'
begin 666
#0V%T
end
$
For Python 3 ''where the codecs module needs to be imported and used directly'':
$ python3 -c "from codecs import encode;print(encode(b'Cat', 'uu'))"
b'begin 666 \n#0V%T\n \nend\n'
$
Perl
The Perl
Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
language supports uuencoding natively using the pack() and unpack() operators with the format string "u":
$ perl -e 'print pack("u","Cat")'
#0V%T
Decoding base64 with unpack can likewise be accomplished by translating the characters:
$ perl -e '$a="Q2F0"; $a=~tr#A-Za-z0-9+/\.\_##cd; # remove non-base64 chars
> $a=~tr#A-Za-z0-9+/# -_#; # translate sets
> print unpack("u",pack("C",32+int(length($1)*6 / 8)) . $1) while($a=~s/(., .+)//);'
Cat
See also
* Binary-to-text encoding
A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the channel does not allow binary dat ...
for a comparison of various encoding algorithms
References
External links
* ''uuencode'' entry in POSIX.1-2008, http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html
GNU-sharutils
– open source suite of shar/unshar/uuencode/uudecode utilities
UUDeview
– open-source program to encode/decode Base64, BinHex, uuencode, xxencode, etc. for Unix/Windows/DOS
UUENCODE-UUDECODE
– open-source program to encode/decode created by Clem "Grandad" Dye
– Open Source fast UUDecoder for Macintosh by Stuart Cheshire
Stuart Cheshire is a Distinguished Engineer, Scientist and Technologist (DEST) at Apple. He pioneered Zeroconf networking while employed at Apple. Zeroconf was originally released by Apple as Rendezvous, but later renamed Bonjour. Subsequently, he ...
UUENCODE-UUDECODE
– Free on-line UUEncoder and UUDecoder
Java UUDecoder
– Open Source Java library for decoding uuencoded (mail) attachments
AN11229
– NXP application note: UUencoding for UART ISP
{{Data exchange
Email
Usenet
Binary-to-text encoding formats
Unix SUS2008 utilities