HTML sanitization

Last updated

In data sanitization, HTML sanitization is the process of examining an HTML document and producing a new HTML document that preserves only whatever tags are designated "safe" and desired. HTML sanitization can be used to protect against attacks such as cross-site scripting (XSS) by sanitizing any HTML code submitted by a user.

Contents

Details

Basic tags for changing fonts are often allowed, such as <b>, <i>, <u>, <em>, and <strong> while more advanced tags such as <script>, <object>, <embed>, and <link> are removed by the sanitization process. Also potentially dangerous attributes such as the onclick attribute are removed in order to prevent malicious code from being injected.

Sanitization is typically performed by using either a whitelist or a blacklist approach. Leaving a safe HTML element off a whitelist is not so serious; it simply means that that feature will not be included post-sanitation. On the other hand, if an unsafe element is left off a blacklist, then the vulnerability will not be sanitized out of the HTML output. An out-of-date blacklist can therefore be dangerous if new, unsafe features have been introduced to the HTML Standard.

Further sanitization can be performed based on rules which specify what operation is to be performed on the subject tags. Typical operations include removal of the tag itself while preserving the content, preserving only the textual content of a tag or forcing certain values on attributes. [1]

Implementations

In PHP, HTML sanitization can be performed using the strip_tags() function at the risk of removing all textual content following an unclosed less-than symbol or angle bracket. [2] The HTML Purifier library is another popular option for PHP applications. [3]

In Java (and .NET), sanitization can be achieved by using the OWASP Java HTML Sanitizer Project. [4]

In .NET, a number of sanitizers use the Html Agility Pack, an HTML parser. [5] [6] [1]

In JavaScript there are "JS-only" sanitizers for the back end, and browser-based [7] implementations that use browser's own Document Object Model (DOM) parser to parse the HTML (for better performance).

Related Research Articles

Document Object Model Convention for representing and interacting with objects in HTML, XHTML and XML documents

The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree; with them one can change the structure, style or content of a document. Nodes can have event handlers attached to them. Once an event is triggered, the event handlers get executed.

Server-side scripting is a technique used in web development which involves employing scripts on a web server which produces a response customized for each user's (client's) request to the website. The alternative is for the web server itself to deliver a static web page. Scripts can be written in any of a number of server-side scripting languages that are available. Server-side scripting is distinguished from client-side scripting where embedded scripts, such as JavaScript, are run client-side in a web browser, but both techniques are often used together.

Bookmarklet Bookmark stored in web browsers containing JavaScript commands to add new features

A bookmarklet is a bookmark stored in a web browser that contains JavaScript commands that add new features to the browser. They are stored as the URL of a bookmark in a web browser or as a hyperlink on a web page. Bookmarklets are usually small snippets of JavaScript executed when user clicks on them. When clicked, bookmarklets can perform a wide variety of operations, such as running a search query from selected text or extracting data from a table.

Cross-site scripting Computer security vulnerability

Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may be used by attackers to bypass access controls such as the same-origin policy. Cross-site scripting carried out on websites accounted for roughly 84% of all security vulnerabilities documented by Symantec up until 2007. XSS effects vary in range from petty nuisance to significant security risk, depending on the sensitivity of the data handled by the vulnerable site and the nature of any security mitigation implemented by the site's owner network.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

Code injection Computer bug exploit caused by invalid data

Code injection is the exploitation of a computer bug that is caused by processing invalid data. The injection is used by an attacker to introduce code into a vulnerable computer program and change the course of execution. The result of successful code injection can be disastrous, for example, by allowing computer viruses or computer worms to propagate.

JSON Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers.

Dynamic web page Type of web page

A server-side dynamic web page is a web page whose construction is controlled by an application server processing server-side scripts. In server-side scripting, parameters determine how the assembly of every new web page proceeds, including the setting up of more client-side processing.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

A webform, web form or HTML form on a web page allows a user to enter data that is sent to a server for processing. Forms can resemble paper or database forms because web users fill out the forms using checkboxes, radio buttons, or text fields. For example, forms can be used to enter shipping or credit card data to order a product, or can be used to retrieve search results from a search engine.

jQuery is a JavaScript library designed to simplify HTML DOM tree traversal and manipulation, as well as event handling, CSS animation, and Ajax. It is free, open-source software using the permissive MIT License. As of May 2019, jQuery is used by 73% of the 10 million most popular websites. Web analysis indicates that it is the most widely deployed JavaScript library by a large margin, having at least 3 to 4 times more usage than any other JavaScript library.

CodeLite Integrated development environment

CodeLite is a free and open-source IDE for the C, C++, PHP, and JavaScript (Node.js) programming languages.

Cross-site request forgery, also known as one-click attack or session riding and abbreviated as CSRF or XSRF, is a type of malicious exploit of a website where unauthorized commands are submitted from a user that the web application trusts. There are many ways in which a malicious website can transmit such commands; specially-crafted image tags, hidden forms, and JavaScript XMLHttpRequests, for example, can all work without the user's interaction or even knowledge. Unlike cross-site scripting (XSS), which exploits the trust a user has for a particular site, CSRF exploits the trust that a site has in a user's browser

Opa (programming language)

Opa is an open-source programming language for developing scalable web applications.

Mustache is a web template system with implementations available for ActionScript, C++, Clojure, CoffeeScript, ColdFusion, Common Lisp, Crystal, D, Dart, Delphi, Elixir, Erlang, Fantom, Go, Haskell, Io, Java, JavaScript, Julia, Lua, .NET, Objective-C, OCaml, Perl, PHP, Pharo, Python, R, Racket, Raku, Ruby, Rust, Scala, Smalltalk, Swift, Tcl, CFEngine, and XQuery.

Content Security Policy (CSP) is a computer security standard introduced to prevent cross-site scripting (XSS), clickjacking and other code injection attacks resulting from execution of malicious content in the trusted web page context. It is a Candidate Recommendation of the W3C working group on Web Application Security, widely supported by modern web browsers. CSP provides a standard method for website owners to declare approved origins of content that browsers should be allowed to load on that website—covered types are JavaScript, CSS, HTML frames, web workers, fonts, images, embeddable objects such as Java applets, ActiveX, audio and video files, and other HTML5 features.

XHP is an augmentation of PHP and Hack developed at Meta to allow XML syntax for the purpose of creating custom and reusable HTML elements. It is available as an open-source software GitHub project and as a Homebrew module for PHP 5.3, 5.4, and 5.5. Facebook have also developed a similar augmentation for JavaScript, namely JSX.

A headless browser is a web browser without a graphical user interface.

React (JavaScript library) JavaScript library for building user interfaces

React is a free and open-source front-end JavaScript library for building user interfaces based on UI components. It is maintained by Meta and a community of individual developers and companies. React can be used as a base in the development of single-page, mobile, or server-rendered applications with frameworks like Next.js. However, React is only concerned with state management and rendering that state to the DOM, so creating React applications usually requires the use of additional libraries for routing, as well as certain client-side functionality.

Vue.js Open-source JavaScript framework for building user interfaces

Vue.js is an open-source model–view–viewmodel front end JavaScript framework for building user interfaces and single-page applications. It was created by Evan You, and is maintained by him and the rest of the active core team members.

References

  1. 1 2 "HtmlRuleSanitizer". GitHub . 13 August 2021.
  2. "strip_tags". PHP.NET.
  3. "HTML Purifier - Filter your HTML the standards-compliant way!". htmlpurifier.org.
  4. "OWASP Java HTML Sanitizer".
  5. "Archived copy". Archived from the original on 2013-01-01. Retrieved 2013-01-04.{{cite web}}: CS1 maint: archived copy as title (link)
  6. "Whitelist santize with HtmlAgilityPack". 14 June 2011.
  7. "JS HTML Sanitizer". GitHub . 14 October 2021.