Automated Content Access Protocol

Last updated

Automated Content Access Protocol ("ACAP") was proposed in 2006 as a method of providing machine-readable permissions information for content, in the hope that it would have allowed automated processes (such as search-engine web crawling) to be compliant with publishers' policies without the need for human interpretation of legal terms. ACAP was developed by organisations that claimed to represent sections of the publishing industry (World Association of Newspapers, European Publishers Council, International Publishers Association). [1] It was intended to provide support for more sophisticated online publishing business models, but was criticised for being biased towards the fears of publishers who see search and aggregation as a threat [2] rather than as a source of traffic and new readers.

Contents

Status

In November 2007 ACAP announced that the first version of the standard was ready. No non-ACAP members, whether publishers or search engines, have adopted it so far. A Google spokesman appeared to have ruled out adoption. [3] In March 2008, Google's CEO Eric Schmidt stated that "At present it does not fit with the way our systems operate". [4] No progress has been announced since the remarks in March 2008 and Google, [5] along with Yahoo! and MSN, have since reaffirmed their commitment to the use of robots.txt and sitemaps.

In 2011 management of ACAP was turned over to the International Press Telecommunications Council and announced that ACAP 2.0 would be based on Open Digital Rights Language 2.0. [6]

Previous milestones

In April 2007 ACAP commenced a pilot project in which the participants and technical partners undertook to specify and agree various use cases for ACAP to address. A technical workshop, attended by the participants and invited experts, has been held in London to discuss the use cases and agree next steps.

By February 2007 the pilot project was launched and participants announced.

By October 2006, ACAP had completed a feasibility stage and was formally announced [7] at the Frankfurt Book Fair on 6 October 2006. A pilot program commenced in January 2007 involving a group of major publishers and media groups working alongside search engines and other technical partners.

ACAP and search engines

ACAP rules can be considered as an extension to the Robots Exclusion Standard (or "robots.txt") for communicating website access information to automated web crawlers.

It has been suggested [8] that ACAP is unnecessary, since the robots.txt protocol already exists for the purpose of managing search engine access to websites. However, others [9] support ACAP’s view [10] that robots.txt is no longer sufficient. ACAP argues that robots.txt was devised at a time when both search engines and online publishing were in their infancy and as a result is insufficiently nuanced to support today’s much more sophisticated business models of search and online publishing. ACAP aims to make it possible to express more complex permissions than the simple binary choice of “inclusion” or “exclusion”.

As an early priority, ACAP is intended to provide a practical and consensual solution to some of the rights-related issues which in some cases have led to litigation [11] [12] between publishers and search engines.

The Robots Exclusion Standard has always been implemented voluntarily by both content providers and search engines, and ACAP implementation is similarly voluntary for both parties. [13] However, Beth Noveck has expressed concern that the emphasis on communicating access permissions in legal terms will lead to lawsuits if search engines do not comply with ACAP permissions. [14]

No public search engines recognise ACAP. Only one, Exalead, ever confirmed that they will be adopting the standard, [15] but they have since ceased functioning as a search portal to focus on the software side of their business.

Comment and debate

The project has generated considerable online debate, in the search, [16] content [17] and intellectual property [18] communities. If there are any common themes in commentary, they are

  1. that keeping the specification simple will be critical to its successful implementation, and
  2. that the aims of the project are focussed on the needs of publishers, rather than readers. Many have seen this as a flaw. [2] [19]

See also

Related Research Articles

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

Web crawler Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard can be used in conjunction with Sitemaps, a robot inclusion standard for websites.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

In the context of the World Wide Web, deep linking is the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website, rather than the website's home page. The URL contains all the information needed to point to a particular item. Deep linking is different from mobile deep linking, which refers to directly linking to in-app content using a non-HTTP URI.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engines. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer-scientist Michael K. Bergman is credited with coining the term in 2001 as a search-indexing term.

In computing, a user agent is any software, acting on behalf of a user, which "retrieves, renders and facilitates end-user interaction with Web content". A user agent is therefore a special kind of software agent.

The noindex value of an HTML robots meta tag requests that automated Internet bots avoid indexing a web page. Reasons why one might want to use this meta tag include advising robots not to index a very large database, web pages that are very transitory, web pages that are under development, web pages that one wishes to keep slightly more private, or the printer and mobile-friendly versions of pages. Since the burden of honoring a website's noindex tag lies with the author of the search robot, sometimes these tags are ignored. Also the interpretation of the noindex tag is sometimes slightly different from one search engine company to the next.

A site map is a list of pages of a web site within a domain.

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow setting allows web site authors to indicate that the presence of a link is not an endorsement of the target site's importance.

WebCite is an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted from it. The preservation service enables verifiability of claims supported by the cited sources even when the original web pages are being revised, removed, or disappear for other reasons, an effect known as link rot.

Google Search Console is a web service by Google which allows webmasters to check indexing status and optimize visibility of their websites.

Bing Webmaster Tools

Bing Webmaster Tools is a free service as part of Microsoft's Bing search engine which allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing and a lot more. The service also offers tools for webmasters to troubleshoot the crawling and indexing of their website, submission of new URLs, Sitemap creation, submission and ping tools, website statistics, consolidation of content submission, and new content and community resources.

A Biositemap is a way for a biomedical research institution of organisation to show how biological information is distributed throughout their Information Technology systems and networks. This information may be shared with other organisations and researchers.

BotSeer was a Web-based information system and search tool used for research on Web robots and trends in Robot Exclusion Protocol deployment and adherence. It was created and designed by Yang Sun, Isaac G. Councill, Ziming Zhuang and C. Lee Giles. BotSeer is now inactive; the original URL was https://web.archive.org/web/20100208214818/http://botseer.ist.psu.edu/

Wayback Machine Digital archive founded by the Internet Archive

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

blekko Web search engine

Blekko, trademarked as blekko (lowercase), was a company that provided a web search engine with the stated goal of providing better search results than those offered by Google Search, with results gathered from a set of 3 billion trusted webpages and excluding such sites as content farms. The company's site, launched to the public on November 1, 2010, used slashtags to provide results for common searches. Blekko also offered a downloadable search bar. It was acquired by IBM in March 2015, and the service was discontinued.

CORE (research service)

CORE is a service provided by the Knowledge Media Institute based at The Open University, United Kingdom. The goal of the project is to aggregate all open access content distributed across different systems, such as repositories and open access journals, enrich this content using text mining and data mining, and provide free access to it through a set of services. The CORE project also aims to promote open access to scholarly outputs. CORE works closely with digital libraries and institutional repositories.

Search engine cache

Search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.

References

  1. ACAP FAQ: Where is the driving force behind ACAP?
  2. 1 2 Douglas, Ian (3 December 2007). "Acap: a shot in the foot for publishing". The Daily Telegraph . Archived from the original on 14 November 2009. Retrieved 3 May 2012.
  3. Search Engine Watch report of Rob Jonas' comments on ACAP Archived 18 March 2008 at the Wayback Machine
  4. Corner, Stuart (18 March 2008). "ACAP content protection protocol "doesn't work" says Google CEO". iTWire. Retrieved 11 March 2018.
  5. Improving on Robots Exclusion Protocol: Official Google Webmaster Central Blog
  6. IPTC Media Release: News syndication version of ACAP ready for launch and management handed over to the IPTC Archived 15 July 2011 at the Wayback Machine
  7. Official ACAP press release announcing project launch Archived 10 June 2007 at the Wayback Machine
  8. News Publishers Want Full Control of the Search Results
  9. "Why you should care about Automated Content Access Protocol". yelvington.com. 16 October 2006. Archived from the original on 11 November 2006. Retrieved 11 March 2018.
  10. "FAQ: What about existing technology, robots.txt and why?". ACAP. Archived from the original on 8 March 2018. Retrieved 11 March 2018.
  11. "Is Google Legal?" OutLaw article about Copiepresse litigation
  12. Guardian article about Google's failed appeal in Copiepresse case
  13. Paul, Ryan (14 January 2008). "A skeptical look at the Automated Content Access Protocol". Ars Technica. Retrieved 9 January 2018.
  14. Noveck, Beth Simone (1 December 2007). "Automated Content Access Protocol". Cairns Blog. Retrieved 9 January 2018.
  15. Exalead Joins Pilot Project on Automated Content Access
  16. Search Engine Watch article Archived 27 January 2007 at the Wayback Machine
  17. Shore.com article about ACAP Archived 21 October 2006 at the Wayback Machine
  18. IP Watch article about ACAP
  19. Douglas, Ian (23 December 2007). "Acap shoots back". The Daily Telegraph . Archived from the original on 7 September 2008.

Further reading