BotSeer

Last updated

BotSeer was a Web-based information system and search tool used for research on Web robots and trends in Robot Exclusion Protocol deployment and adherence. It was created and designed by Yang Sun, [1] Isaac G. Councill, [2] Ziming Zhuang [3] and C. Lee Giles. BotSeer is now inactive; the original URL was https://web.archive.org/web/20100208214818/http://botseer.ist.psu.edu/

Contents

History

BotSeer served as a resource for studying the regulation and behavior of Web robots as well as information about the creation of effective robots.txt files and crawler implementations. It was publicly available on the World Wide Web at the College of Information Sciences and Technology at the Pennsylvania State University.

BotSeer provided three major services including robots.txt searching, robot bias analysis, [4] [5] and robot-generated log analysis. The prototype of BotSeer also allowed users to search six thousand documentation files and source codes from 18 open source crawler projects.

BotSeer had indexed and analyzed 2.2 million robots.txt files obtained from 13.2 million websites, as well as a large Web server log of real-world robot behavior and related analysis. BotSeer's goals were to assist researchers, webmasters, web crawler developers and others with web robots related research and information needs. However, some people received BotSeer negatively, arguing that it contradicted the purpose of the robots.txt convention. [6]

BotSeer had also had set up a honeypot [7] to test the ethics, performance and behavior of web crawlers.

Related Research Articles

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

Web crawler Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard can be used in conjunction with Sitemaps, a robot inclusion standard for websites.

CiteSeerx is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. CiteSeer is considered as a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Googlebot Web crawler used by Google

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

In computing, a user agent is any software, acting on behalf of a user, which "retrieves, renders and facilitates end-user interaction with Web content". A user agent is therefore a special kind of software agent.

ALIWEB is considered the first Web search engine, as its predecessors were either built with different purposes or were only indexers.

The noindex value of an HTML robots meta tag requests that automated Internet bots avoid indexing a web page. Reasons why one might want to use this meta tag include advising robots not to index a very large database, web pages that are very transitory, web pages that are under development, web pages that one wishes to keep slightly more private, or the printer and mobile-friendly versions of pages. Since the burden of honoring a website's noindex tag lies with the author of the search robot, sometimes these tags are ignored. Also the interpretation of the noindex tag is sometimes slightly different from one search engine company to the next.

Clyde Lee Giles is an American computer scientist and the David Reese Professor at the College of Information Sciences and Technology at the Pennsylvania State University. He is also Graduate Faculty Professor of Computer Science and Engineering, Courtesy Professor of Supply Chain and Information Systems, and Director of the Intelligent Systems Research Laboratory. He was Interim Associate Dean of Research. His graduate degrees are from the University of Michigan and the University of Arizona and his undergraduate degrees are from Rhodes College and the University of Tennessee. His PhD is in optical sciences with advisor Harrison H. Barrett. His academic genealogy includes two Nobel laureates and prominent mathematicians.

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

A spider trap is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambots or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages with links that continually point to the next day or year.

nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow setting allows web site authors to indicate that the presence of a link is not an endorsement of the target site's importance.

Search engine Software system that is designed to search for information on the World Wide Web

A search engine is a software system that is designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs) The information may be a mix of links to web pages, images, videos, infographics, articles, research papers, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. Internet content that is not capable of being searched by a web search engine is generally described as the deep web.

A search engine is an information retrieval software program that discovers, crawls, transforms and stores information for retrieval and presentation in response to user queries.

Automated Content Access Protocol ("ACAP") was proposed in 2006 as a method of providing machine-readable permissions information for content, in the hope that it would have allowed automated processes to be compliant with publishers' policies without the need for human interpretation of legal terms. ACAP was developed by organisations that claimed to represent sections of the publishing industry. It was intended to provide support for more sophisticated online publishing business models, but was criticised for being biased towards the fears of publishers who see search and aggregation as a threat rather than as a source of traffic and new readers.

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

Wayback Machine Digital archive founded by the Internet Archive

The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see how websites looked in the past. Its founders, Brewster Kahle and Bruce Gilliat, developed the Wayback Machine to provide "universal access to all knowledge" by preserving archived copies of defunct web pages.

Bingbot is a web-crawling robot, deployed by Microsoft October 2010 to supply Bing. It collects documents from the web to build a searchable index for the Bing. It performs the same function as Google's Googlebot.

Search engine cache

Search engine cache is a cache of web pages that shows the page as it was when it was indexed by a web crawler. Cached versions of web pages can be used to view the contents of a page when the live version cannot be reached, has been altered or taken down.

References

Notes

  1. "Yang Sun". Archived from the original on 2014-01-04. Retrieved 2019-06-13.
  2. Isaac G. Councill Archived May 17, 2014, at the Wayback Machine
  3. Ziming Zhuang Archived December 28, 2007, at the Wayback Machine
  4. Yang Sun, Z. Zhuang, I. Councill, C.L. Giles, Determining Bias to Search Engines from Robots.txt Archived 2015-04-02 at the Wayback Machine , Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), 149-155, 2007.
  5. http://www.zoomwebmedia.com/search-engine-optimization.php [ dead link ]
  6. BotSeer? - SEO Best Practices Search Engine Forums
  7. Web Robot Behavior and Performance Test at the Wayback Machine (archived December 22, 2008) (instead of unrelated current site )

See also