Automated Content Access Protocol ("ACAP") was proposed in 2006 as a method of providing machine-readable permissions information for content, in the hope that it would have allowed automated processes (such as search-engine web crawling) to be compliant with publishers' policies without the need for human interpretation of legal terms. ACAP was developed by organisations that claimed to represent sections of the publishing industry (World Association of Newspapers, European Publishers Council, International Publishers Association). [1] It was intended to provide support for more sophisticated online publishing business models, but was criticised for being biased towards the fears of publishers who see search and aggregation as a threat [2] rather than as a source of traffic and new readers.
In November 2007 ACAP announced that the first version of the standard was ready. No non-ACAP members, whether publishers or search engines, have adopted it so far. A Google spokesman appeared to have ruled out adoption. [3] In March 2008, Google's CEO Eric Schmidt stated that "At present it does not fit with the way our systems operate". [4] No progress has been announced since the remarks in March 2008 and Google, [5] along with Yahoo! and MSN, have since reaffirmed their commitment to the use of robots.txt and sitemaps.
In 2011 management of ACAP was turned over to the International Press Telecommunications Council and announced that ACAP 2.0 would be based on Open Digital Rights Language 2.0. [6]
In April 2007 ACAP commenced a pilot project in which the participants and technical partners undertook to specify and agree various use cases for ACAP to address. A technical workshop, attended by the participants and invited experts, has been held in London to discuss the use cases and agree next steps.
By February 2007 the pilot project was launched and participants announced.
By October 2006, ACAP had completed a feasibility stage and was formally announced [7] at the Frankfurt Book Fair on 6 October 2006. A pilot program commenced in January 2007 involving a group of major publishers and media groups working alongside search engines and other technical partners.
ACAP rules can be considered as an extension to the Robots Exclusion Standard (or "robots.txt") for communicating website access information to automated web crawlers.
It has been suggested [8] that ACAP is unnecessary, since the robots.txt protocol already exists for the purpose of managing search engine access to websites. However, others [9] support ACAP’s view [10] that robots.txt is no longer sufficient. ACAP argues that robots.txt was devised at a time when both search engines and online publishing were in their infancy and as a result is insufficiently nuanced to support today’s much more sophisticated business models of search and online publishing. ACAP aims to make it possible to express more complex permissions than the simple binary choice of “inclusion” or “exclusion”.
As an early priority, ACAP is intended to provide a practical and consensual solution to some of the rights-related issues which in some cases have led to litigation [11] [12] between publishers and search engines.
The Robots Exclusion Standard has always been implemented voluntarily by both content providers and search engines, and ACAP implementation is similarly voluntary for both parties. [13] However, Beth Noveck has expressed concern that the emphasis on communicating access permissions in legal terms will lead to lawsuits if search engines do not comply with ACAP permissions. [14]
No public search engines recognise ACAP. Only one, Exalead, ever confirmed that they will be adopting the standard, [15] but they have since ceased functioning as a search portal to focus on the software side of their business.
The project has generated considerable online debate, in the search, [16] content [17] and intellectual property [18] communities. If there are any common themes in commentary, they are
Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head
section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head
elements and attributes.
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.
robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.
In the context of the World Wide Web, deep linking is the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website, rather than the website's home page. The URL contains all the information needed to point to a particular item. Deep linking is different from mobile deep linking, which refers to directly linking to in-app content using a non-HTTP URI.
The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.
Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web.
In computing, the User-Agent header is an HTTP header intended to identify the user agent responsible for making a given HTTP request. Whereas the character sequence User-Agent
comprises the name of the header itself, the header value that a given user agent uses to identify itself is colloquially known as its user agent string. The user agent for the operator of a computer used to access the Web has encoded within the rules that govern its behavior the knowledge of how to negotiate its half of a request-response transaction; the user agent thus plays the role of the client in a client–server system. Often considered useful in networks is the ability to identify and distinguish the software facilitating a network session. For this reason, the User-Agent HTTP header exists to identify the client software to the responding server.
The noindex value of an HTML robots meta tag requests that automated Internet bots avoid indexing a web page. Reasons why one might want to use this meta tag include advising robots not to index a very large database, web pages that are very transitory, web pages that are under development, web pages that one wishes to keep slightly more private, or the printer and mobile-friendly versions of pages. Since the burden of honoring a website's noindex tag lies with the author of the search robot, sometimes these tags are ignored. Also the interpretation of the noindex tag is sometimes slightly different from one search engine company to the next.
A sitemap is a list of pages of a web site within a domain.
Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt
, a URL exclusion protocol.
nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>
. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow
setting allows website authors to indicate that the presence of a link is not an endorsement of the target site's importance.
A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.
Google Search Console is a web service by Google which allows webmasters to check indexing status, search queries, crawling errors and optimize visibility of their websites.
Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.
A Biositemap is a way for a biomedical research institution of organisation to show how biological information is distributed throughout their Information Technology systems and networks. This information may be shared with other organisations and researchers.
BotSeer was a Web-based information system and search tool used for research on Web robots and trends in Robot Exclusion Protocol deployment and adherence. It was created and designed by Yang Sun, Isaac G. Councill, Ziming Zhuang and C. Lee Giles. BotSeer was in operation from 2007 to 2010, approximately.
The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, an American nonprofit organization based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" to see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.
archive.today is a web archiving website founded in 2012 that saves snapshots on demand, and has support for JavaScript-heavy sites such as Google Maps, and Twitter. archive.today records two snapshots: one replicates the original webpage including any functional live links; the other is a screenshot of the page.
Perplexity AI is an AI-powered research and conversational search engine that answers queries using natural language predictive text. Launched in 2022, Perplexity generates answers using sources from the web and cites links within the text response. Perplexity works on a freemium model; the free product uses the company's standalone large language model (LLM) that incorporates natural language processing (NLP) capabilities, while the paid version Perplexity Pro has access to GPT-4, Claude 3.5, Mistral Large, Llama 3 and an Experimental Perplexity Model. In Q1 2024, it had reached 15 million monthly users.