Data URI scheme

Last updated

The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in Web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests, [1] and used by several browser extensions to package images as well as other multimedia content in a single HTML file for page saving. [2] [3] As of 2024, data URIs are fully supported by all major browsers. [4]

Contents

Syntax

The syntax of data URIs is defined in Request for Comments (RFC) 2397, published in August 1998, [5] and follows the URI scheme syntax. A data URI consists of:

data:content/type;base64,

Examples of data URIs showing most of the features are:

data:text/vnd-example+xyz;foo=bar;base64,R0lGODdh
data:text/plain;charset=UTF-8;page=21,the%20data:1234,5678
(outputs: "the data:1234,5678")

data:image/svg+xml;utf8,<svg width='10'... </svg>

The minimal data URI is data:,, consisting of the scheme, no media-type, and zero-length data.

Thus, within the overall URI syntax, a data URI consists of a scheme and a path, with no authority part, query string, or fragment. The optional media type, the optional base64 indicator, and the data are all parts of the URI path.

Examples of use

HTML

An HTML fragment embedding a base64 encoded PNG picture of a small red dot: Red-dot-5px.png

<imgalt=""src=""style="width:36pt;height:36pt"/>

In this example, the lines are broken for formatting purposes. In actual URIs, including data URIs, control characters (ASCII 0 to 31, and 127) and spaces (ASCII 32) are "excluded characters". This means that whitespace characters are not permitted in data URIs. However, in the context of HTML 4 and HTML 5, linefeeds within an element attribute value (such as the "src" above) are ignored[ citation needed ]. So the data URI above would be processed ignoring the linefeeds, giving the correct result. But note that this is an HTML feature, not a data URI feature, and in other contexts, it is not possible to rely on whitespace within the URI being ignored.

An HTML fragment embedding a utf8 encoded SVG picture of a small red dot: Red-dot.svg

<imgalt="Red dot"src="data:image/svg+xml;utf8,<svg width='10' height='10' xmlns='http://www.w3.org/2000/svg'> <circle style='fill:red' cx='5' cy='5' r='5'/></svg>"/>

In this example, the image data is encoded with utf8 and hence the image data can broken into multiple lines for easy reading. Single quote has to be used in the SVG data as double quote is used for encapsulating the image source.

A favicon can also be made with utf8 encoding and SVG data which has to appear in the 'head' section of the HTML:

<linkrel="icon"href='data:image/svg+xml;utf8,<svg width="10" height="10" xmlns="http://www.w3.org/2000/svg"> <circle style="fill:red" cx="5" cy="5" r="5"/></svg>'/>

CSS

A Cascading Style Sheets (CSS) rule that includes a background image:

ul.checklistli.complete{padding-left:20px;background:whiteurl('\ORw0KGgoAAAANSUhEUgAAABAAAAAQAQMAAAAlPW0iAAAABlBMVEU\AAAD///+l2Z/dAAAAM0lEQVR4nGP4/5/h/1+G/58ZDrAz3D/McH8\yw83NDDeNGe4Ug9C9zwz3gVLMDA/A6P9/AFGGFyjOXZtQAAAAAEl\FTkSuQmCC')no-repeatscrolllefttop;}

In this example, the \ + <linefeed> line terminators are a feature of CSS, indicating continuation on the next line. These would be removed by the CSS stylesheet processor, and the data URI would be reconstituted without whitespace, making it correct, since whitespace is not allowed within the data component of a data: URI.

JavaScript

A JavaScript statement that opens an embedded subwindow, as for a footnote link:

window.open('data:text/html;charset=utf-8,'+encodeURIComponent(// Escape for URL formatting'<!DOCTYPE html>'+'<html lang="en">'+'<head><title>Embedded Window</title></head>'+'<body><h1>42</h1></body>'+'</html>'));

SVG

Example of an SVG image with embedded JPEG images 35 mm angle of view vs focal length.svg
Example of an SVG image with embedded JPEG images

A Scalable Vector Graphic image containing an embedded JPEG image encoded in Base64:

<svg><imagewidth="64"height="24"href=""/></svg>

Malware and phishing

The data URI can be utilized to construct attack pages that attempt to obtain usernames and passwords from unsuspecting web users. It can also be used to get around cross-site scripting (XSS) restrictions, embedding the attack payload fully inside the address bar, and hosted via URL shortening services rather than needing a full website that is controlled by a third party. [8] As a result, some browsers now block webpages from navigating to data URIs. [9]

Related Research Articles

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies an abstract or physical resource, such as resources on a webpage, mail address, phone number, books, real-world objects such as people and places, concepts. URIs are used to identify anything described using the Resource Description Framework (RDF), for example, concepts that are part of an ontology defined using the Web Ontology Language (OWL), and people who are described using the Friend of a Friend vocabulary would each have an individual URI.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

Web standards are the formal, non-proprietary standards and other technical specifications that define and describe aspects of the World Wide Web. In recent years, the term has been more frequently associated with the trend of endorsing a set of standardized best practices for building web sites, and a philosophy of web design and development that includes those methods.

A query string is a part of a uniform resource locator (URL) that assigns values to specified parameters. A query string commonly includes fields added to a base URL by a Web browser or other client application, for example as part of an HTML document, choosing the appearance of a page, or jumping to positions in multimedia content.

XML Signature defines an XML syntax for digital signatures and is defined in the W3C recommendation XML Signature Syntax and Processing. Functionally, it has much in common with PKCS #7 but is more extensible and geared towards signing XML documents. It is used by various Web technologies such as SOAP, SAML, and others.

URL encoding, officially known as percent-encoding, is a method to encode arbitrary data in a uniform resource identifier (URI) using only the US-ASCII characters legal within a URI. Although it is known as URL encoding, it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such, it is also used in the preparation of data of the application/x-www-form-urlencoded media type, as is often used in the submission of HTML form data in HTTP requests.

Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

A media type is a two-part identifier for file formats and format contents transmitted on the Internet. Their purpose is somewhat similar to file extensions in that they identify the intended data format. The Internet Assigned Numbers Authority (IANA) is the official authority for the standardization and publication of these classifications. Media types were originally defined in Request for Comments RFC 2045 (MIME) Part One: Format of Internet Message Bodies in November 1996 as a part of the MIME specification, for denoting type of email message content and attachments; hence the original name, MIME type. Media types are also used by other internet protocols such as HTTP and document file formats such as HTML, for similar purposes.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

A Formal Public Identifier (FPI) is a short piece of text with a particular structure that may be used to uniquely identify a product, specification or document. FPIs were introduced as part of Standard Generalized Markup Language (SGML), and serve particular purposes in formats historically derived from SGML. Some of their most common uses are as part of document type declarations (DOCTYPEs) and document type definitions (DTDs) in SGML, XML and historically HTML, but they are also used in the vCard and iCalendar file formats to identify the software product which generated the file.

A uniform resource locator (URL), colloquially known as an address on the Web, is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI), although many people use the two terms interchangeably. URLs occur most commonly to reference web pages (HTTP/HTTPS) but are also used for file transfer (FTP), email (mailto), database access (JDBC), and many other applications.

010 Editor is a commercial hex editor and text editor for Microsoft Windows, Linux and macOS. Typically 010 Editor is used to edit text files, binary files, hard drives, processes, tagged data, source code, shell scripts, log files, etc. A large variety of binary data formats can be edited through the use of Binary Templates.

References

  1. "Using Data URIs to Speed Up Your Website". Treehouse Blog. 27 March 2014.
  2. "SingleFile - Chrome Web Store". Chrome Web Store. Retrieved 25 August 2018.
  3. "SingleFile – Add-ons for Firefox". Firefox Add-ons. Retrieved 25 August 2018.
  4. Deveria, Alexis (July 2015). "Can I use..." Retrieved 31 August 2015.
  5. Masinter, L (August 1998). "RFC 2397 - The "data" URL scheme". Internet Engineering Task Force . Retrieved 2008-08-12.
  6. Freed, Ned; Dürst, Martin, eds. (20 December 2013). "Character Sets". Internet Assigned Numbers Authority . Retrieved 31 August 2015.
  7. Berners-Lee, Tim; Fielding, Roy; Masinter, Larry (January 2005). "Uniform Resource Identifiers (URI): Generic Syntax". Internet Engineering Task Force . Retrieved 31 August 2015.
  8. Phishing without a webpage – researcher reveals how a link itself can be malicious, Naked Security by Sophos, 31 AUG 2012 https://nakedsecurity.sophos.com/2012/08/31/phishing-without-a-webpage-researcher-reveals-how-a-link-itself-can-be-malicious/
  9. "Data URLs - HTTP | MDN". MDN Web Docs. Mozilla. Retrieved 11 May 2018.