Automated Content Access Protocol ("ACAP") was proposed in 2006 as a method of providing machine-readable permissions information for content, in the hope that it would have allowed automated processes (such as search-engine web crawling) to be compliant with publishers' policies without the need for human interpretation of legal terms. ACAP was developed by organisations that claimed to represent sections of the publishing industry (World Association of Newspapers, European Publishers Council, International Publishers Association). [1] It was intended to provide support for more sophisticated online publishing business models, but was criticised for being biased towards the fears of publishers who see search and aggregation as a threat [2] rather than as a source of traffic and new readers.
In November 2007 ACAP announced that the first version of the standard was ready. No non-ACAP members, whether publishers or search engines, have adopted it so far. A Google spokesman appeared to have ruled out adoption. [3] In March 2008, Google's CEO Eric Schmidt stated that "At present it does not fit with the way our systems operate". [4] No progress has been announced since the remarks in March 2008 and Google, [5] along with Yahoo! and MSN, have since reaffirmed their commitment to the use of robots.txt and sitemaps.
In 2011 management of ACAP was turned over to the International Press Telecommunications Council and announced that ACAP 2.0 would be based on Open Digital Rights Language 2.0. [6]
In April 2007 ACAP commenced a pilot project in which the participants and technical partners undertook to specify and agree various use cases for ACAP to address. A technical workshop, attended by the participants and invited experts, has been held in London to discuss the use cases and agree next steps.
By February 2007 the pilot project was launched and participants announced.
By October 2006, ACAP had completed a feasibility stage and was formally announced [7] at the Frankfurt Book Fair on 6 October 2006. A pilot program commenced in January 2007 involving a group of major publishers and media groups working alongside search engines and other technical partners.
ACAP rules can be considered as an extension to the Robots Exclusion Standard (or "robots.txt") for communicating website access information to automated web crawlers.
It has been suggested [8] that ACAP is unnecessary, since the robots.txt protocol already exists for the purpose of managing search engine access to websites. However, others [9] support ACAP’s view [10] that robots.txt is no longer sufficient. ACAP argues that robots.txt was devised at a time when both search engines and online publishing were in their infancy and as a result is insufficiently nuanced to support today’s much more sophisticated business models of search and online publishing. ACAP aims to make it possible to express more complex permissions than the simple binary choice of “inclusion” or “exclusion”.
As an early priority, ACAP is intended to provide a practical and consensual solution to some of the rights-related issues which in some cases have led to litigation [11] [12] between publishers and search engines.
The Robots Exclusion Standard has always been implemented voluntarily by both content providers and search engines, and ACAP implementation is similarly voluntary for both parties. [13] However, Beth Noveck has expressed concern that the emphasis on communicating access permissions in legal terms will lead to lawsuits if search engines do not comply with ACAP permissions. [14]
No public search engines recognise ACAP. Only one, Exalead, ever confirmed that they will be adopting the standard, [15] but they have since ceased functioning as a search portal to focus on the software side of their business.
The project has generated considerable online debate, in the search, [16] content [17] and intellectual property [18] communities. If there are any common themes in commentary, they are
Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head
section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head
elements and attributes.
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard can be used in conjunction with Sitemaps, a robot inclusion standard for websites.
Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.
In the context of the World Wide Web, deep linking is the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website, rather than the website's home page. The URL contains all the information needed to point to a particular item. Deep linking is different from mobile deep linking, which refers to directly linking to in-app content using a non-HTTP URI.
The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engines. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer-scientist Michael K. Bergman is credited with coining the term in 2001 as a search-indexing term.
In computing, a user agent is any software, acting on behalf of a user, which "retrieves, renders and facilitates end-user interaction with Web content". A user agent is therefore a special kind of software agent.
The noindex value of an HTML robots meta tag requests that automated Internet bots avoid indexing a web page. Reasons why one might want to use this meta tag include advising robots not to index a very large database, web pages that are very transitory, web pages that are under development, web pages that one wishes to keep slightly more private, or the printer and mobile-friendly versions of pages. Since the burden of honoring a website's noindex tag lies with the author of the search robot, sometimes these tags are ignored. Also the interpretation of the noindex tag is sometimes slightly different from one search engine company to the next.
A site map is a list of pages of a web site within a domain.
The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt
, a URL exclusion protocol.
nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>
. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow
setting allows web site authors to indicate that the presence of a link is not an endorsement of the target site's importance.
WebCite is an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted from it. The preservation service enables verifiability of claims supported by the cited sources even when the original web pages are being revised, removed, or disappear for other reasons, an effect known as link rot.
Google Search Console is a web service by Google which allows webmasters to check indexing status and optimize visibility of their websites.
Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.
A Biositemap is a way for a biomedical research institution of organisation to show how biological information is distributed throughout their Information Technology systems and networks. This information may be shared with other organisations and researchers.
BotSeer was a Web-based information system and search tool used for research on Web robots and trends in Robot Exclusion Protocol deployment and adherence. It was created and designed by Yang Sun, Isaac G. Councill, Ziming Zhuang and C. Lee Giles. BotSeer is now inactive; the original URL was https://web.archive.org/web/20100208214818/http://botseer.ist.psu.edu/
The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.
Blekko, trademarked as blekko (lowercase), was a company that provided a web search engine with the stated goal of providing better search results than those offered by Google Search, with results gathered from a set of 3 billion trusted webpages and excluding such sites as content farms. The company's site, launched to the public on November 1, 2010, used slashtags to provide results for common searches. Blekko also offered a downloadable search bar. It was acquired by IBM in March 2015, and the service was discontinued.
CORE is a service provided by the Knowledge Media Institute based at The Open University, United Kingdom. The goal of the project is to aggregate all open access content distributed across different systems, such as repositories and open access journals, enrich this content using text mining and data mining, and provide free access to it through a set of services. The CORE project also aims to promote open access to scholarly outputs. CORE works closely with digital libraries and institutional repositories.
Search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.