HTML Sanitization
   HOME

TheInfoList



OR:

In data sanitization, HTML sanitization is the process of examining an
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
document and producing a new HTML document that preserves only whatever tags are designated "safe" and desired. HTML sanitization can be used to protect against attacks such as
cross-site scripting Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may ...
(XSS) by sanitizing any HTML code submitted by a user.


Details

Basic tags for changing fonts are often allowed, such as <b>, <i>, <u>, <em>, and <strong> while more advanced tags such as <script>, <object>, <embed>, and <link> are removed by the sanitization process. Also potentially dangerous
attributes Attribute may refer to: * Attribute (philosophy), an extrinsic property of an object * Attribute (research), a characteristic of an object * Grammatical modifier, in natural languages * Attribute (computing), a specification that defines a prope ...
such as the onclick attribute are removed in order to prevent malicious code from being injected. Sanitization is typically performed by using either a
whitelist A whitelist, allowlist, or passlist is a mechanism which explicitly allows some identified entities to access a particular privilege, service, mobility, or recognition i.e. it is a list of things allowed when everything is denied by default. It is ...
or a
blacklist Blacklisting is the action of a group or authority compiling a blacklist (or black list) of people, countries or other entities to be avoided or distrusted as being deemed unacceptable to those making the list. If someone is on a blacklist, t ...
approach. Leaving a safe HTML element off a whitelist is not so serious; it simply means that that feature will not be included post-sanitation. On the other hand, if an unsafe element is left off a blacklist, then the vulnerability will not be sanitized out of the HTML output. An out-of-date blacklist can therefore be dangerous if new, unsafe features have been introduced to the HTML Standard. Further sanitization can be performed based on rules which specify what operation is to be performed on the subject tags. Typical operations include removal of the tag itself while preserving the content, preserving only the textual content of a tag or forcing certain values on attributes.


Implementations

In
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group. ...
, HTML sanitization can be performed using the strip_tags() function at the risk of removing all textual content following an unclosed less-than symbol or angle bracket. The HTML Purifier library is another popular option for PHP applications. In
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
(and .NET), sanitization can be achieved by using the
OWASP The Open Web Application Security Project (OWASP) is an online community that produces freely-available articles, methodologies, documentation, tools, and technologies in the field of web application security. The OWASP provides free and open ...
Java HTML Sanitizer Project. In .NET, a number of sanitizers use the Html Agility Pack, an HTML parser. In
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
there are "JS-only" sanitizers for the back end, and browser-based implementations that use browser's own
Document Object Model The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document wi ...
(DOM) parser to parse the HTML (for better performance).


References

{{Reflist HTML