Clean URL

Last updated March 21, 2024

Clean URLs (also known as user-friendly URLs, pretty URLs, search engine-friendly URLs or RESTful URLs) are web addresses or Uniform Resource Locator (URLs) intended to improve the usability and accessibility of a website, web application, or web service by being immediately and intuitively meaningful to non-expert users. Such URL schemes tend to reflect the conceptual structure of a collection of information and decouple the user interface from a server's internal representation of information. Other reasons for using clean URLs include search engine optimization (SEO),^[1] conforming to the representational state transfer (REST) style of software architecture, and ensuring that individual web resources remain consistently at the same URL. This makes the World Wide Web a more stable and useful system, and allows more durable and reliable bookmarking of web resources.^[2]

Clean URLs also do not contain implementation details of the underlying web application. This carries the benefit of reducing the difficulty of changing the implementation of the resource at a later date. For example, many URLs include the filename of a server-side script, such as example.php, example.asp or cgi-bin. If the underlying implementation of a resource is changed, such URLs would need to change along with it. Likewise, when URLs are not "clean", if the site database is moved or restructured it has the potential to cause broken links, both internally and from external sites, the latter of which can lead to removal from search engine listings. The use of clean URLs presents a consistent location for resources to user-agents regardless of internal structure. A further potential benefit to the use of clean URLs is that the concealment of internal server or application information can improve the security of a system.^[1]

Structure

A URL will often comprise a path, script name, and query string. The query string parameters dictate the content to show on the page, and frequently include information opaque or irrelevant to users—such as internal numeric identifiers for values in a database, illegibly encoded data, session IDs, implementation details, and so on. Clean URLs, by contrast, contain only the path of a resource, in a hierarchy that reflects some logical structure that users can easily interpret and manipulate.

Original URL	Clean URL
`http://example.com/about.html`	`http://example.com/about`
`http://example.com/user.php?id=1`	`http://example.com/user/1`
`http://example.com/index.php?page=name`	`http://example.com/name`
`http://example.com/kb/index.php?cat=1&id=23`	`http://example.com/kb/1/23`
`http://en.wikipedia.org/w/index.php?title=Clean_URL`	`http://en.wikipedia.org/wiki/Clean_URL`

Implementation

The implementation of clean URLs involves URL mapping via pattern matching or transparent rewriting techniques. As this usually takes place on the server side, the clean URL is often the only form seen by the user.

For search engine optimization purposes, web developers often take this opportunity to include relevant keywords in the URL and remove irrelevant words. Common words that are removed include articles and conjunctions, while descriptive keywords are added to increase user-friendliness and improve search engine rankings.^[1]

A fragment identifier can be included at the end of a clean URL for references within a page, and need not be user-readable.^[3]

Slug

Some systems define a slug as the part of a URL that identifies a page in human-readable keywords.^[4]^[5] It is usually the end part of the URL (specifically of the path / pathinfo part), which can be interpreted as the name of the resource, similar to the basename in a filename or the title of a page. The name is based on the use of the word slug in the news media to indicate a short name given to an article for internal use.

Slugs are typically generated automatically from a page title but can also be entered or altered manually, so that while the page title remains designed for display and human readability, its slug may be optimized for brevity or for consumption by search engines, as well as providing recipients of a shared bare URL with a rough idea of the page's topic. Long page titles may also be truncated to keep the final URL to a reasonable length.

Slugs may be entirely lowercase, with accented characters replaced by letters from the Latin script and whitespace characters replaced by a hyphen or an underscore to avoid being encoded. Punctuation marks are generally removed, and some also remove short, common words such as conjunctions. For example, the title This, That, and the Other! An Outré Collection could have a generated slug of this-that-other-outre-collection.

Another benefit of URL slugs is the facilitated ability to find a desired page out of a long list of URLs without page titles, such as a minimal list of opened tabs exported using a browser extension, and the ability to preview the approximate title of a target page in the browser if hyperlinked to without title.

Should a tool to save web pages locally use the string after the last slash as the default file name, like wget does, a slug makes the file name more descriptive.

Websites that make use of slugs include Stack Exchange Network with question title after slash, and Instagram with ?taken-by=username URL parameter.^[6]^[7]

Related Research Articles

The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, where hypertext documents include hyperlinks to other resources that the user can easily access, for example by a mouse click or by tapping the screen in a web browser.

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies an abstract or physical resource, such as resources on a webpage, mail address, phone number, books, real-world objects such as people and places, concepts. URIs are used to identify anything described using the Resource Description Framework (RDF), for example, concepts that are part of an ontology defined using the Web Ontology Language (OWL), and people who are described using the Friend of a Friend vocabulary would each have an individual URI.

<span class="mw-page-title-main">World Wide Web</span> Linked hypertext system on the Internet

The World Wide Web is an information system that enables content sharing over the Internet through user-friendly ways meant to appeal to users beyond IT specialists and hobbyists. It allows documents and other web resources to be accessed over the Internet according to specific rules of the Hypertext Transfer Protocol (HTTP).

In computer network communications, the HTTP 404, 404 not found, 404, 404 error, page not found, or file not found error message is a hypertext transfer protocol (HTTP) standard response code, to indicate that the browser was able to communicate with a given server, but the server could not find what was requested. The error may also be used when a server does not wish to disclose whether it has the requested information.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

In web applications, a rewrite engine is a software component that performs rewriting on URLs, modifying their appearance. This modification is called URL rewriting. It is a way of implementing URL mapping or routing within a web application. The engine is typically a component of a web server or web application framework. Rewritten URLs are used to provide shorter and more relevant-looking links to web pages. The technique adds a layer of abstraction between the files used to generate a web page and the URL that is presented to the outside world.

URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. Similarly, domain redirection or domain forwarding is when all pages in a URL domain are redirected to a different domain, as when wikipedia.com and wikipedia.net are automatically redirected to wikipedia.org.

A query string is a part of a uniform resource locator (URL) that assigns values to specified parameters. A query string commonly includes fields added to a base URL by a Web browser or other client application, for example as part of an HTML document, choosing the appearance of a page, or jumping to positions in multimedia content.

A permalink or permanent link is a URL that is intended to remain unchanged for many years into the future, yielding a hyperlink that is less susceptible to link rot. Permalinks are often rendered simply, that is, as clean URLs, to be easier to type and remember. Most modern blogging and content-syndication software systems support such links. Sometimes URL shortening is used to create them.

In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.

A web resource is any identifiable resource present on or connected to the World Wide Web. Resources are identified using Uniform Resource Identifiers (URIs). In the Semantic Web, web resources and their semantic properties are described using the Resource Description Framework (RDF).

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

In HTTP, "Referer" is an optional HTTP header field that identifies the address of the web page, from which the resource has been requested. By checking the referrer, the server providing the new web page can see where the request originated.

In the context of the World Wide Web, a bookmark is a Uniform Resource Identifier (URI) that is stored for later retrieval in any of various storage formats. All modern web browsers include bookmark features. Bookmarks are called favorites or Internet shortcuts in Internet Explorer and Microsoft Edge, and by virtue of that browser's large market share, these terms have been synonymous with bookmark since the First Browser War. Bookmarks are normally accessed through a menu in the user's web browser, and folders are commonly used for organization. In addition to bookmarking methods within most browsers, many external applications offer bookmarks management.

In programming, a file uniform resource identifier (URI) scheme is a specific format of URI, used to specifically identify a file on a host computer. While URIs can be used to identify anything, there is specific syntax associated with identifying files.

Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page. The duplicate content can be substantial parts of the content within or across domains and can be either exactly duplicate or closely similar. When multiple pages contain essentially the same content, search engines such as Google and Bing can penalize or cease displaying the copying site in any relevant search results.

A uniform resource locator (URL), colloquially known as an address on the Web, is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI), although many people use the two terms interchangeably. URLs occur most commonly to reference web pages (HTTP/HTTPS) but are also used for file transfer (FTP), email (mailto), database access (JDBC), and many other applications.

References

1 2 3 Opitz, Pascal (28 February 2006). "Clean URLs for better search engine ranking". Content with Style. Archived from the original on 6 January 2012. Retrieved 9 September 2010.
↑ Berners-Lee, Tim (1998). "Cool URIs don't change". Style Guide for online hypertext. W3C. Retrieved 6 March 2011.
↑ "Uniform Resource Identifier (URI): Generic Syntax". RFC 3986 . Internet Engineering Task Force. Retrieved 2 May 2014.
↑ Slug in the WordPress glossary
↑ Slug in the Django glossary
↑ "Question URL slugs based on title". Meta Stack Exchange. 2011-10-10.
↑ "16 Best Instagram Tricks And Hidden Features You Must Know". Fossbytes. 2017-08-04.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[cws-1] 1 2 3 Opitz, Pascal (28 February 2006). "Clean URLs for better search engine ranking". Content with Style. Archived from the original on 6 January 2012. Retrieved 9 September 2010.

[2] Berners-Lee, Tim (1998). "Cool URIs don't change". Style Guide for online hypertext. W3C. Retrieved 6 March 2011.

[3] "Uniform Resource Identifier (URI): Generic Syntax". RFC 3986 . Internet Engineering Task Force. Retrieved 2 May 2014.

[4] Slug in the WordPress glossary

[5] Slug in the Django glossary

[6] "Question URL slugs based on title". Meta Stack Exchange. 2011-10-10.

[7] "16 Best Instagram Tricks And Hidden Features You Must Know". Fossbytes. 2017-08-04.

[1]

[2]

[3]

[4]

[5]

[6]

[7]