File URI scheme

Last updated

In programming, a file uniform resource identifier (URI) scheme is a specific format of URI, used to specifically identify a file on a host computer. While URIs can be used to identify anything, there is specific syntax associated with identifying files. [1] [2]

Contents

Format

A file URI has the format

file://host/path

where host is the fully qualified domain name of the system on which the path is accessible, and path is a hierarchical directory path of the form directory/directory/.../name. If host is omitted, it is taken to be "localhost", the machine from which the URL is being interpreted. Note that when omitting host, the slash is not omitted (while "file:///piro.txt" is valid, "file://simpen.txt" is not, although some interpreters manage to handle the latter).

RFC 3986 includes additional information about the treatment of ".." and "." segments in URIs.

Number of slash characters

There are two ways that Windows UNC filenames (such as \\server\folder\data.xml) can be represented. These are both described in RFC 8089, Appendix E as "non-standard". The first way (called here the 2-slash format) is to represent the server name using the Authority part of the URI, which then becomes file://server/folder/data.xml. The second way (called here the 4-slash format) is to represent the server name as part of the Path component, so the URI becomes file:////server/folder/data.xml. Both forms are actively used. Microsoft .NET (for example, the method new Uri(path)) generally uses the 2-slash form; Java (for example, the method new URI(path)) generally uses the 4-slash form. Either form allows the most common operations on URIs (resolving relative URIs, and dereferencing to obtain a connection to the remote file) to be used successfully. However, because these URIs are non-standard, some less common operations fail: an example is the normalize operation (defined in RFC 3986 and implemented in the Java java.net.URI.normalize() method) which reduces file:////server/folder/data.xml to the unusable form file:/server/folder/data.xml. [6]

Examples

Unix

Here are two Unix examples pointing to the same /etc/fstab file:

file://localhost/etc/fstab file:///etc/fstab

The KDE environment uses URIs without an authority field:

file:/etc/fstab

Windows

Here are some examples which may be accepted by some applications on Windows systems, referring to the same, local file c:\WINDOWS\clock.avi

file://localhost/c:/WINDOWS/clock.avi file:///c:/WINDOWS/clock.avi

Here is the URI as understood by the Windows Shell API: [7]

file:///c:/WINDOWS/clock.avi

Note that the drive letter followed by a colon and slash is part of the acceptable file URI.

Implementations

Windows

On Microsoft Windows systems, the normal colon (:) after a device letter has sometimes been replaced by a vertical bar (|) in file URLs. This reflected the original URL syntax, which made the colon a reserved character in a path part.

Since Internet Explorer 4, file URIs have been standardized on Windows, and should follow the following scheme. This applies to all applications which use URLMON or SHLWAPI for parsing, fetching or binding to URIs. To convert a path to a URL, use UrlCreateFromPath, and to convert a URL to a path, use PathCreateFromUrl. [7]

To access a file "the file.txt", the following might be used.

For a network location:

file://hostname/path/to/the%20file.txt

Or for a local file, the hostname is omitted, but the slash is not (note the third slash):

file:///c:/path/to/the%20file.txt

This is not the same as providing the string "localhost" or the dot "." in place of the hostname. The string "localhost" will attempt to access the file as UNC path \\localhost\c:\path\to\the file.txt, which will not work since the colon is not allowed in a share name. The dot "." results in the string being passed as \\.\c:\path\to\the file.txt, which will work for local files, but not shares on the local system. For example file://./sharename/path/to/the%20file.txt will not work, because it will result in sharename being interpreted as part of the DOSDEVICES namespace, not as a network share.

The following outline roughly describes the requirements.

Use the provided functions if possible. If you must create a URL programmatically and cannot access SHLWAPI.dll (for example from script, or another programming environment where the equivalent functions are not available) the above outline will help.

Legacy URLs

To aid the installed base of legacy applications on Win32 PathCreateFromUrl recognizes certain URLs which do not meet these criteria, and treats them uniformly. These are called "legacy" file URLs as opposed to "healthy" file URLs. [8]


In the past, a variety of other applications have used other systems. Some added an additional two slashes. For example, UNC path \\remotehost\share\dir\file.txt would become file:////remotehost/share/dir/file.txt instead of the "healthy" file://remotehost/share/dir/file.txt.

Web pages

File URLs are rarely used in Web pages on the public Internet, since they imply that a file exists on the designated host. The host specifier can be used to retrieve a file from an external source, although no specific file-retrieval protocol is specified; and using it should result in a message that informs the user that no mechanism to access that machine is available.

Related Research Articles

A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies an abstract or physical resource, such as resources on a webpage, mail address, phone number, books, real-world objects such as people and places, concepts. URIs are used to identify anything described using the Resource Description Framework (RDF), for example, concepts that are part of an ontology defined using the Web Ontology Language (OWL), and people who are described using the Friend of a Friend vocabulary would each have an individual URI.

An 8.3 filename is one that obeys the filename convention used by old versions of DOS and versions of Microsoft Windows prior to Windows 95 and Windows NT 3.5. It is also used in modern Microsoft operating systems as an alternate filename to the long filename, to provide compatibility with legacy programs. The filename convention is limited by the FAT file system. Similar 8.3 file naming schemes have also existed on earlier CP/M, TRS-80, Atari, and some Data General and Digital Equipment Corporation minicomputer operating systems.

DICT is a dictionary network protocol created by the DICT Development Group in 1997, described by RFC 2229. Its goal is to surpass the Webster protocol to allow clients to access a variety of dictionaries via a uniform interface.

<span class="mw-page-title-main">Filename</span> Text string used to uniquely identify a computer file

A filename or file name is a name used to uniquely identify a computer file in a file system. Different file systems impose different restrictions on filename lengths.

A path is a string of characters used to uniquely identify a location in a directory structure. It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, represent each directory. The delimiting character is most commonly the slash ("/"), the backslash character ("\"), or colon (":"), though some operating systems may use a different delimiter. Paths are used extensively in computer science to represent the directory/file relationships common in modern operating systems and are essential in the construction of Uniform Resource Locators (URLs). Resources can be represented by either absolute or relative paths.

In computer networking, localhost is a hostname that refers to the current computer used to access it. The name localhost is reserved for loopback purposes. It is used to access the network services that are running on the host via the loopback network interface. Using the loopback interface bypasses any local network interface hardware.

The computer file hosts is an operating system file that maps hostnames to IP addresses. It is a plain text file. Originally a file named HOSTS.TXT was manually maintained and made available via file sharing by Stanford Research Institute for the ARPANET membership, containing the hostnames and address of hosts as contributed for inclusion by member organizations. The Domain Name System, first described in 1983 and implemented in 1984, automated the publication process and provided instantaneous and dynamic hostname resolution in the rapidly growing network. In modern operating systems, the hosts file remains an alternative name resolution mechanism, configurable often as part of facilities such as the Name Service Switch as either the primary method or as a fallback method.

A query string is a part of a uniform resource locator (URL) that assigns values to specified parameters. A query string commonly includes fields added to a base URL by a Web browser or other client application, for example as part of an HTML document, choosing the appearance of a page, or jumping to positions in multimedia content.

In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in Web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests, and used by several browser extensions to package images as well as other multimedia content in a single HTML file for page saving. As of 2024, data URIs are fully supported by all major browsers.

In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.

URL encoding, officially known as percent-encoding, is a method to encode arbitrary data in a uniform resource identifier (URI) using only the US-ASCII characters legal within a URI. Although it is known as URL encoding, it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). Consequently, it is also used in the preparation of data of the application/x-www-form-urlencoded media type, as is often used in the submission of HTML form data in HTTP requests.

A directory traversal attack exploits insufficient security validation or sanitization of user-supplied file names, such that characters representing "traverse to parent directory" are passed through to the operating system's file system API. An affected application can be exploited to gain unauthorized access to the file system.

<span class="mw-page-title-main">URI normalization</span> Process by which URIs are standardized

URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.

Clean URLs are web addresses or Uniform Resource Locator (URLs) intended to improve the usability and accessibility of a website, web application, or web service by being immediately and intuitively meaningful to non-expert users. Such URL schemes tend to reflect the conceptual structure of a collection of information and decouple the user interface from a server's internal representation of information. Other reasons for using clean URLs include search engine optimization (SEO), conforming to the representational state transfer (REST) style of software architecture, and ensuring that individual web resources remain consistently at the same URL. This makes the World Wide Web a more stable and useful system, and allows more durable and reliable bookmarking of web resources.

The Open Packaging Conventions (OPC) is a container-file technology initially created by Microsoft to store a combination of XML and non-XML files that together form a single entity such as an Open XML Paper Specification (OpenXPS) document. OPC-based file formats combine the advantages of leaving the independent file entities embedded in the document intact and resulting in much smaller files compared to normal use of XML.

<span class="mw-page-title-main">HTTP location</span> Instruction by web server containing the intended location of a web page.

The HTTP Location header field is returned in responses from an HTTP server under two circumstances:

  1. To ask a web browser to load a different web page. In this circumstance, the Location header should be sent with an HTTP status code of 3xx. It is passed as part of the response by a web server when the requested URI has:
  2. To provide information about the location of a newly created resource. In this circumstance, the Location header should be sent with an HTTP status code of 201 or 202.

BagIt is a set of hierarchical file system conventions designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" and "tags," which are metadata files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. The name, BagIt, is inspired by the "enclose and deposit" method, sometimes referred to as "bag it and tag it."

Ksar is a BSD-licensed Java-based application that creates graphs of all parameters from data collected by Unix sar utilities. Usually, Unix sar is part of Unix' Sysstat package, and runs sa1, sa2, and sadc through cron to created data files in /var/log/sa/saNN. Characteristics include:

A uniform resource locator (URL), colloquially known as an address on the Web, is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI), although many people use the two terms interchangeably. URLs occur most commonly to reference web pages (HTTP/HTTPS) but are also used for file transfer (FTP), email (mailto), database access (JDBC), and many other applications.

References

  1. Kerwin, Matthew (February 2017). The "file" URI Scheme (Report). Internet Engineering Task Force.
  2. "What is a Uniform Resource Identifier (URI)?". WhatIs.com. Retrieved 2023-09-12.
  3. RFC 8089, Section 2
  4. RFC 3986, Section 3.2.2
  5. RFC 3986, Section 3.3
  6. RFC 8089, Appendix E
  7. 1 2 Risney, Dave (2006). "File URIs in Windows". IEBlog. Microsoft Corporation. Retrieved 2020-10-02.
  8. The Bizarre and Unhappy Story of 'file:' URLs - Free Associations - Site Home - MSDN Blogs. Blogs.msdn.com (2005-05-19). Retrieved on 2014-03-08.