HOME

TheInfoList



OR:

Content sniffing, also known as media type sniffing or MIME sniffing, is the practice of inspecting the content of a
byte stream A bitstream (or bit stream), also known as binary sequence, is a sequence of bits. A bytestream is a sequence of bytes. Typically, each byte is an Octet (computing), 8-bit quantity, and so the term octet stream is sometimes used interchangeab ...
to attempt to deduce the
file format A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...
of the data within it. Content sniffing is generally used to compensate for a lack of accurate
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
that would otherwise be required to enable the file to be interpreted correctly. Content sniffing techniques tend to use a mixture of techniques that rely on the redundancy found in most file formats: looking for file signatures and magic numbers, and
heuristic A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...
s including searching for well-known representative substrings, the use of byte frequency and ''n''-gram tables, and
Bayesian inference Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
. Multipurpose Internet Mail Extensions (MIME) sniffing was, and still is, used by some
web browser A web browser, often shortened to browser, is an application for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's scr ...
s, including notably
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
's
Internet Explorer Internet Explorer (formerly Microsoft Internet Explorer and Windows Internet Explorer, commonly abbreviated as IE or MSIE) is a deprecation, retired series of graphical user interface, graphical web browsers developed by Microsoft that were u ...
, in an attempt to help web sites which do not correctly signal the MIME type of web content display. However, doing this opens up a serious security vulnerability, in which, by confusing the MIME sniffing algorithm, the browser can be manipulated into interpreting data in a way that allows an attacker to carry out operations that are not expected by either the site operator or user, such as
cross-site scripting Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may be ...
. Moreover, by making sites which do not correctly assign MIME types to content appear to work correctly in those browsers, it fails to encourage the correct labeling of material, which in turn makes content sniffing necessary for these sites to work, creating a vicious circle of incompatibility with web standards and security best practices. A specification exists for media type sniffing in
HTML5 HTML5 (Hypertext Markup Language 5) is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommend ...
, which attempts to balance the requirements of security with the need for reverse compatibility with web content with missing or incorrect MIME-type data. It attempts to provide a precise specification that can be used across implementations to implement a single well-defined and deterministic set of behaviors. The UNIX command can be viewed as a content sniffing application.


Charset sniffing

Numerous web browsers use a more limited form of content sniffing to attempt to determine the
character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
of text files for which the MIME type is already known. This technique is known as charset sniffing or codepage sniffing and, for certain encodings, may be used to bypass security restrictions too. For instance,
Internet Explorer 7 Windows Internet Explorer 7 (IE7) (codenamed Rincon) is a version of Internet Explorer, a web browser for Windows. It was released by Microsoft on October 18, 2006. It was the first major update to the browser since 2001. It does not support ve ...
may be tricked to run
JScript JScript is Microsoft's legacy dialect of the ECMAScript standard that is used in Microsoft's Internet Explorer web browser and HTML Applications, and as a standalone Windows scripting language. JScript is implemented as an Active Scripting eng ...
in circumvention of its policy by allowing the browser to guess that an
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
-file was encoded in UTF-7. This bug is worsened by the feature of the UTF-7 encoding which permits multiple encodings of the same text and, specifically, alternative representations of
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
characters. Most encodings do not allow evasive presentations of ASCII characters, so charset sniffing is less dangerous in general because, due to the historical accident of the ASCII-centric nature of scripting and markup languages, characters outside the ASCII repertoire are more difficult to use to circumvent security boundaries, and misinterpretations of character sets tend to produce results no worse than the display of
mojibake Mojibake (; , 'character transformation') is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often ...
.


See also

*
Browser sniffing Browser sniffing (also known as browser detection) is a set of techniques used in websites and web applications in order to determine the web browser a visitor is using, and to serve browser-appropriate content to the visitor. It is also used to d ...


References


External links


X-Content-Type-Options header

MIME Sniffing Standard
* * * {{cite web, url=http://deletethis.net/dave/?q=mime-sniffing, title=Mime-sniffing, author=David Risney, access-date=2012-07-14 Heuristics * Web technology