Percent-encoding, also known as URL encoding, is a method to
encode
The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome.
ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
arbitrary data in a
Uniform Resource Identifier
A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies a logical or physical resource used by web technologies. URIs may be used to identify anything, including real-world objects, such as people and places, conc ...
(URI) using only the limited
US-ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
characters legal within a URI. Although it is known as ''URL encoding'', it is also used more generally within the main
Uniform Resource Identifier
A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies a logical or physical resource used by web technologies. URIs may be used to identify anything, including real-world objects, such as people and places, conc ...
(URI) set, which includes both
Uniform Resource Locator
A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifi ...
(URL) and
Uniform Resource Name
A Uniform Resource Name (URN) is a Uniform Resource Identifier (URI) that uses the scheme. URNs are globally unique persistent identifiers assigned within defined namespaces so they will be available for a long period of time, even after the res ...
(URN). As such, it is also used in the preparation of data of the
application/x-www-form-urlencoded
media type
A media type (also known as a MIME type) is a two-part identifier for file formats and format contents transmitted on the Internet. The Internet Assigned Numbers Authority (IANA) is the official authority for the standardization and publication o ...
, as is often used in the submission of
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
form
Form is the shape, visual appearance, or configuration of an object. In a wider sense, the form is the way something happens.
Form also refers to:
*Form (document), a document (printed or electronic) with spaces in which to write or enter data
...
data in
HTTP
The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...
requests.
Percent-encoding in a URI
Types of URI characters
The characters allowed in a URI are either ''reserved'' or ''unreserved'' (or a
percent character as part of a percent-encoding). ''Reserved'' characters are those characters that sometimes have special meaning. For example,
forward slash
The slash is the oblique slanting line punctuation mark . Also known as a stroke, a solidus or several other historical or technical names including oblique and virgule. Once used to mark periods and commas, the slash is now used to represe ...
characters are used to separate different parts of a URL (or more generally, a URI). ''Unreserved'' characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.
Other characters in a URI must be percent-encoded.
Reserved characters
When a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some ''other'' purpose, then the character must be ''percent-encoded''. Percent-encoding a reserved character involves converting the character to its corresponding byte value in
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
and then representing that value as a pair of
hexadecimal
In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
digits (if there is a single hex digit, a
leading zero
A leading zero is any 0 digit that comes before the first nonzero digit in a number string in positional notation.. For example, James Bond's famous identifier, 007, has two leading zeros. Any zeroes appearing to the left of the first non-zero d ...
are added). The digits, preceded by a
percent sign
The percent sign (sometimes per cent sign in British English) is the symbol used to indicate a percentage, a number or ratio as a fraction of 100. Related signs include the permille (per thousand) sign and the permyriad (per ten thousand) s ...
(
%
) as an
escape character
In computing and telecommunication, an escape character is a character (computing), character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharac ...
, are then used in the URI in place of the reserved character.
(For a non-ASCII character, it is typically converted to its byte sequence in
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
, and then each byte value is represented as above.)
The reserved character
/
, for example, if used in the "path" component of a
URI Uri may refer to:
Places
* Canton of Uri, a canton in Switzerland
* Úri, a village and commune in Hungary
* Uri, Iran, a village in East Azerbaijan Province
* Uri, Jammu and Kashmir, a town in India
* Uri (island), an island off Malakula Islan ...
, has the special meaning of being a
delimiter
A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts a ...
''between'' path segments. If, according to a given URI scheme,
/
needs to be ''in'' a path segment, then the three characters
%2F
or
%2f
must be used in the segment instead of a raw
/
.
Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not.
In the "
query" component of a URI (the part after a
?
character), for example,
/
is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose.
URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting the same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.
Unreserved characters
Characters from the unreserved set never need to be percent-encoded.
URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers ''should not'' treat
%41
differently from
A
or
%7E
differently from
~
, but some do. For maximal interoperability, URI producers are discouraged from percent-encoding unreserved characters.
Percent character
Because the percent character (
%
) serves as the indicator for percent-encoded octets, it must be percent-encoded as
%25
for that octet to be used as data within a URI.
Arbitrary data
Most URI schemes involve the representation of arbitrary data, such as an
IP address
An Internet Protocol address (IP address) is a numerical label such as that is connected to a computer network that uses the Internet Protocol for communication.. Updated by . An IP address serves two main functions: network interface ident ...
or
file system
In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
path, as components of a URI. URI scheme specifications should, but often don't, provide an explicit mapping between URI characters and all possible data values being represented by those characters.
Binary data
Since the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of
binary data
Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.
Binary data occurs in many different technical and scientific fields, wher ...
in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above. Byte value 0x0F, for example, should be represented by
%0F
, but byte value 0x41 can be represented by
A
, or
%41
. The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred, as it results in shorter URLs.
Character data
The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the
World Wide Web
The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet.
Documents and downloadable media are made available to the network through web se ...
's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte,
stateful, and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.
For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified
character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.
Arbitrary character data is sometimes percent-encoded and used in non-URI situations, such as for password-obfuscation programs or other system-specific translation protocols.
Current standard
The generic URI syntax recommends that new URI schemes that provide for the representation of character data in a URI should, in effect, represent characters from the unreserved set without translation and should convert all other characters to bytes according to
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
, and then percent-encode those values. This suggestion was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary or character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.
Non-standard implementations
There exists a non-standard encoding for Unicode characters:
%u''xxxx''
, where ''xxxx'' is a
UTF-16
UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has bee
rejectedby the W3C. The 13th edition of
ECMA-262 still includes an
escape
function that uses this syntax, which applies
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
encoding to a string, then percent-escapes the resulting bytes.
The application/x-www-form-urlencoded type
When data that has been entered into HTML
form
Form is the shape, visual appearance, or configuration of an object. In a wider sense, the form is the way something happens.
Form also refers to:
*Form (document), a document (printed or electronic) with spaces in which to write or enter data
...
s is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method
GET
Get or GET may refer to:
* Get (animal), the offspring of an animal
* Get (divorce document), in Jewish religious law
* GET (HTTP), a type of HTTP request
* "Get" (song), by the Groggers
* Georgia Time, used in the Republic of Georgia
* Get AS, a ...
or
POST
Post or POST commonly refers to:
*Mail, the postal system, especially in Commonwealth of Nations countries
**An Post, the Irish national postal service
**Canada Post, Canadian postal service
**Deutsche Post, German postal service
**Iraqi Post, Ira ...
, or, historically, via
email
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
. The encoding used by default is based on an early version of the general URI percent-encoding rules,
with a number of modifications such as
newline
Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
normalization and replacing spaces with
+
instead of
%20
. The
media type
A media type (also known as a MIME type) is a two-part identifier for file formats and format contents transmitted on the Internet. The Internet Assigned Numbers Authority (IANA) is the official authority for the standardization and publication o ...
of data encoded this way is
application/x-www-form-urlencoded
, and it is currently defined in the HTML and
XForms
XForms is an XML format used for collecting inputs from web forms. XForms was designed to be the next generation of HTML / XHTML forms, but is generic enough that it can also be used in a standalone manner or with presentation languages other tha ...
specifications. In addition, the
CGI specification contains rules for how web servers decode data of this type and make it available to applications.
When HTML form data is sent in an HTTP GET request, it is included in the
query component of the request URI using the same syntax described above. When sent in an HTTP
POST
Post or POST commonly refers to:
*Mail, the postal system, especially in Commonwealth of Nations countries
**An Post, the Irish national postal service
**Canada Post, Canadian postal service
**Deutsche Post, German postal service
**Iraqi Post, Ira ...
request or via email, the data is placed in the body of the message, and
application/x-www-form-urlencoded
is included in the message's Content-Type header.
See also
*
Internationalized Resource Identifier
*
Punycode
Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, wh ...
*
Binary-to-text encoding
A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the channel does not allow binary dat ...
for a comparison of various encoding algorithms
*
Shellcode
In hacking, a shellcode is a small piece of code used as the payload in the exploitation of a software vulnerability. It is called "shellcode" because it typically starts a command shell from which the attacker can control the compromised m ...
References
External links
The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other:
* /
STD 66 (plu
errata, the current generic URI syntax specification.
* (obsolete, plu
errata and RFC 2732 (plu
errata together comprised the previous version of the generic URI syntax specification.
* (mostly obsolete) and RFC 1808 (obsolete), which define
URLs
A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
.
* {{IETF RFC, 1630, link=no (obsolete), the first generic URI syntax specification.
W3C Guidelines on Naming and Addressing: URIs, URLs, ...
Various implementations:
DevPal URL encoder– online developer tools that support URL encoding.
Online URL encoder and decoder– encodes or decodes URLs within the browser.
URL Encoder online– a website with various options to convert files or texts into URL-encoded format.
URL Encode and Decode - Online– a website with various options to convert files or texts into URL-encoded format.
URI schemes
Internet Standards
Binary-to-text encoding formats