CiteSeerX

Last updated
CiteSeerx
Type of site
Bibliographic database
Owner Pennsylvania State University College of Information Sciences and Technology
Website citeseerx.ist.psu.edu
RegistrationOptional
Launched2007;12 years ago (2007)
Current statusActive
Content license
Creative Commons BY-NC-SA license [1]

CiteSeerx (originally called CiteSeer) is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. CiteSeer holds a United States patent # 6289342, titled "Autonomous citation indexing and literature browsing using citation context," granted on September 11, 2001. Stephen R. Lawrence, C. Lee Giles, Kurt D. Bollacker are the inventors of this patent assigned to NEC Laboratories America, Inc. This patent was filed on May 20, 1998, which has its roots (Priority) to January 5, 1998. A continuation patent was also granted to the same inventors and also assigned to NEC Labs on this invention i.e. US Patent # 6738780 granted on May 18, 2004 and was filed on May 16, 2001. CiteSeer is considered as a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search.[ citation needed ] CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information which must be consulted, akin to other techniques for managing information overload.

A digital library, digital repository, or digital collection, is an online database of digital objects that can include text, still images, audio, video, or other digital media formats. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection.

Computer science study of the theoretical foundations of information and computation

Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. It enables the use of algorithms to manipulate, store, and communicate digital information. A computer scientist studies the theory of computation and the practice of designing software systems.

Contents

CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered as part of the open access movement that is attempting to change academic and scientific publishing to allow greater access to scientific literature. CiteSeer freely provided Open Archives Initiative metadata of all indexed documents and links indexed documents when possible to other sources of metadata such as DBLP and the ACM Portal. To promote open data, CiteSeerx shares its data for non-commercial purposes under a Creative Commons license. [1]

Academic publishing is the subfield of publishing which distributes academic research and scholarship. Most academic work is published in academic journal article, book or thesis form. The part of academic written output that is not formally published but merely printed up or posted on the Internet is often called "grey literature". Most scientific and scholarly journals, and many academic and scholarly books, though not all, are based on some form of peer review or editorial refereeing to qualify texts for publication. Peer review quality and selectivity standards vary greatly from journal to journal, publisher to publisher, and field to field.

The Open Archives Initiative (OAI) is an organization to develop and apply technical interoperability standards for archives to share catalog information (metadata). It attempts to build a "low-barrier interoperability framework" for archives containing digital content. It allows people to harvest metadata. This metadata is used to provide "value-added services", often by combining different data sets.

Metadata data about data

Metadata is "data [information] that provides information about other data". Many distinct types of metadata exist, among these descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.

The name can be construed to have at least two explanations. As a pun, a 'sightseer' is a tourist who looks at the sights, so a 'cite seer' would be a researcher who looks at cited papers. Another is a 'seer' is a prophet and a 'cite seer' is a prophet of citations. CiteSeer changed its name to ResearchIndex at one point and then changed it back.

History

CiteSeer and CiteSeer.IST

CiteSeer was created by researchers Lee Giles, Kurt Bollacker and Steve Lawrence in 1997 while they were at the NEC Research Institute (now NEC Labs), Princeton, New Jersey, USA. CiteSeer's goal was to actively crawl and harvest academic and scientific documents on the web and use autonomous citation indexing to permit querying by citation or by document, ranking them by citation impact. At one point, it was called ResearchIndex.

Clyde Lee Giles is an American computer scientist and the David Reese Professor at the College of Information Sciences and Technology at the Pennsylvania State University. He is also Graduate Faculty Professor of Computer Science and Engineering, Courtesy Professor of Supply Chain and Information Systems, and Director of the Intelligent Systems Research Laboratory. He was Interim Associate Dean of Research. His graduate degrees are from the University of Michigan and the University of Arizona and his undergraduate degrees are from Rhodes College and the University of Tennessee. His PhD is in optical sciences with advisor Harrison H. Barrett. His academic genealogy includes two Nobel laureates and prominent mathematicians.

Kurt Bollacker American computer scientist

Kurt Bollacker is an American computer scientist with a research background in the areas of machine learning, digital libraries, semantic networks, and electro-cardiographic modeling.

Steve Lawrence is an Australian computer scientist. He was among the group at NEC Research which was responsible for the creation of the Search Engine/Digital Library CiteSeer. He was an employee at Google. He is currently a Co-founder & CTO at Xoo.

CiteSeer became public in 1998 and had many new features unavailable in academic search engines at that time. These included:

After NEC, in 2004 it was hosted as CiteSeer.IST on the World Wide Web at the College of Information Sciences and Technology, The Pennsylvania State University, and had over 700,000 documents. For enhanced access, performance and research, similar versions of CiteSeer were supported at universities such as the Massachusetts Institute of Technology, University of Zürich and the National University of Singapore. However, these versions of CiteSeer proved difficult to maintain and are no longer available. Because CiteSeer only indexes freely available papers on the web and does not have access to publisher metadata, it returns fewer citation counts than sites, such as Google Scholar, that have publisher metadata.

World Wide Web System of interlinked hypertext documents accessed over the Internet

The World Wide Web (WWW), commonly known as the Web, is an information space where documents and other web resources are identified by Uniform Resource Locators, which may be interlinked by hypertext, and are accessible over the Internet. The resources of the WWW may be accessed by users by a software application called a web browser.

Pennsylvania State University Public university with multiple campuses in Pennsylvania, United States

The Pennsylvania State University is a state-related, land-grant, doctoral university with campuses and facilities throughout Pennsylvania. Founded in 1855 as the Farmers’ High School of Pennsylvania, the university conducts teaching, research, and public service. Its instructional mission includes undergraduate, graduate, professional and continuing education offered through resident instruction and online delivery. Its University Park campus, the flagship campus, lies within the Borough of State College and College Township. It has two law schools: Penn State Law, on the school's University Park campus, and Dickinson Law, located in Carlisle, 90 miles south of State College. The College of Medicine is located in Hershey. Penn State has another 19 commonwealth campuses and 5 special mission campuses located across the state. Penn State has been labeled one of the "Public Ivies," a publicly funded university considered as providing a quality of education comparable to those of the Ivy League.

Massachusetts Institute of Technology research university in Cambridge, Massachusetts, United States

The Massachusetts Institute of Technology (MIT) is a private research university in Cambridge, Massachusetts. Founded in 1861 in response to the increasing industrialization of the United States, MIT adopted a European polytechnic university model and stressed laboratory instruction in applied science and engineering. The institute is a land-grant, sea-grant and space-grant university with campus extends more than a mile along side the Charles river. The institute is traditionally known for its research and education in the physical sciences, engineering and architecture, but more recently in biology, economics, linguistics, management, and social science and art as well. MIT is often ranked among the world's top five universities.

CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it was limited to papers that are publicly available, usually at an author's homepage, or those submitted by an author. To overcome some of these limitations, a modular and open source architecture for CiteSeer was designed – CiteSeerx.

CiteSeerx

CiteSeerx replaced CiteSeer and all queries to CiteSeer were redirected. CiteSeerx [2] is a public search engine and digital library and repository for scientific and academic papers primarily with a focus on computer and information science. [2] However, recently CiteSeerx has been expanding into other scholarly domains such as economics, physics and others. Released in 2008, it was loosely based on the previous CiteSeer search engine and digital library and is built with a new open source infrastructure, SeerSuite, and new algorithms and their implementations. It was developed by researchers Dr. Isaac Councill and Dr. C. Lee Giles at the College of Information Sciences and Technology, Pennsylvania State University. It continues to support the goals outlined by CiteSeer to actively crawl and harvest academic and scientific documents on the public web and to use a citation inquery by citations and ranking of documents by the impact of citations. Currently, Lee Giles, Prasenjit Mitra, Susan Gauch, Min-Yen Kan, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Pucktada Treeratpituk, Jian Wu, Douglas Jordan, Steve Carman, Jack Carroll, Jim Jansen, and Shuyi Zheng are or have been actively involved in its development. Recently, a table search feature was introduced. [3] It has been funded by the National Science Foundation, NASA, and Microsoft Research.

A disciplinary repository is an online archive containing works or data associated with these works of scholars in a particular subject area. Disciplinary repositories can accept work from scholars from any institution. A disciplinary repository shares the roles of collecting, disseminating, and archiving work with other repositories, but is focused on a particular subject area. These collections can include academic and research papers.

Information science field primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information

Information science is a field primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. Practitioners within and outside the field study application and usage of knowledge in organizations along with the interaction between people, organizations, and any existing information systems with the aim of creating, replacing, improving, or understanding information systems. Historically, information science is associated with computer science, psychology, and technology. However, information science also incorporates aspects of diverse fields such as archival science, cognitive science, commerce, law, linguistics, museology, management, mathematics, philosophy, public policy, and social sciences.

Open-source software software licensed to ensure source code usage rights

Open-source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Open-source software may be developed in a collaborative public manner. Open-source software is a prominent example of open collaboration.

CiteSeerx continues to be rated as one of the world's top repositories and was rated number 1 in July 2010. [4] It currently has over 6 million documents with nearly 6 million unique authors and 120 million citations.

CiteSeerx also shares its software, data, databases and metadata with other researchers, currently by Amazon S3 and by rsync. [5] Its new modular open source architecture and software (available previously on SourceForge but now on GitHub) is built on Apache Solr and other Apache and open source tools which allows it to be a testbed for new algorithms in document harvesting, ranking, indexing, and information extraction.

CiteSeerx caches some PDF files that it has scanned. As such, each page include a DMCA link which can be used to report copyright violations. [6]

Current features

Automated information extraction

CiteSeerx uses automated information extraction tools, usually built on machine learning methods such ParsCit, to extract scholarly document metadata such as title, authors, abstract, citations, etc. As such, there are sometime errors in authors and titles. Other academic search engines have similar errors.

Focused crawling

CiteSeerx crawls publicly available scholarly documents primarily from author webpages and other open resources, and does not have access to publisher metadata. As such citation counts in CiteSeerx are usually less than those in Google Scholar and Microsoft Academic Search who have access to publisher metadata.

Usage

CiteSeerx has nearly 1 million users worldwide based on unique IP addresses and has millions of hits daily. Annual downloads of document PDFs was nearly 200 million for 2015.

Data

CiteSeerx data is regularly shared under a Creative Commons BY-NC-SA license with researchers worldwide and has been and is used in many experiments and competitions.

Other SeerSuite-based search engines

The CiteSeer model had been extended to cover academic documents in business with SmealSearch and in e-business with eBizSearch. However, these were not maintained by their sponsors. An older version of both of these could be once found at BizSeer.IST but is no longer in service.

Other Seer-like search and repository systems have been built for chemistry, ChemXSeer and for archaeology, ArchSeer. Another had been built for robots.txt file search, BotSeer. All of these are built on the open source tool SeerSuite, which uses the open source indexer Lucene.

See also

Related Research Articles

Web crawler Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering)

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.

Open access the availability of scientific and scholarly literature that is digital, online, free of charge, and free of most copyright and licensing restrictions

Open access (OA) is a mechanism by which research outputs are distributed online, free of cost or other barriers, and, in its most precise meaning, with the addition of an open license applied to promote reuse.

A citation index is a kind of bibliographic index, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. A form of citation index is first found in 12th-century Hebrew religious literature. Legal citation indexes are found in the 18th century and were made popular by citators such as Shepard's Citations (1873). In 1960, Eugene Garfield's Institute for Scientific Information (ISI) introduced the first citation index for papers published in academic journals, first the Science Citation Index (SCI), and later the Social Sciences Citation Index (SSCI) and the Arts and Humanities Citation Index (AHCI). The first automated citation indexing was done by CiteSeer in 1997. Other sources for such data include Google Scholar and Elsevier's Scopus.

Astrophysics Data System Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory

The Astrophysics Data System (ADS) is an online database of over eight million astronomy and physics papers from both peer reviewed and non-peer reviewed sources. Abstracts are available free online for almost all articles, and full scanned articles are available in Graphics Interchange Format (GIF) and Portable Document Format (PDF) for older articles. It was developed by the National Aeronautics and Space Administration (NASA), and is managed by the Harvard–Smithsonian Center for Astrophysics.

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI-PMH must support representing metadata in Dublin Core, but may also support additional representations.

SmealSearch was a web portal, search engine and digital library for academic business documents that was originally hosted at the defunct eBusiness Research Center at the Pennsylvania State University. It was based on the CiteSeer digital library and search engine technology. Due to lack of support, it moved to the College of Information Sciences and Technology and became BizSeer. It was enhanced and modified by many including Lee Giles, Yang Sun, Sandip Debnath, Isaac Councill, Arvind Rangaswamy, Nirmal Pal, Yves Petinot and Pradeep Teregowda.

Citation analysis is the examination of the frequency, patterns, and graphs of citations in documents. It uses the pattern of citations, links from one document to another document, to reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academic articles and books. For another example, judges of law support their judgements by referring back to judgements made in earlier cases. An additional example is provided by patents which contain prior art, citation of earlier patents relevant to the current claim.

Google Scholar academic search service by Google

Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes most peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents. While Google does not publish the size of Google Scholar's database, scientometric researchers estimated it to contain roughly 389 million documents including articles, citations and patents making it the world's largest academic search engine in January 2018. Previously, the size was estimated at 160 million documents as of May 2014. Earlier statistical estimate published in PLOS ONE using a Mark and recapture method estimated approximately 80–90% coverage of all articles published in English with an estimate of 100 million. This estimate also determined how many documents were freely available on the web.

An acknowledgement index or acknowledgment index is a method for indexing and analyzing acknowledgments in the scientific literature and, thus, quantifies the impact of acknowledgements. Typically, a scholarly article has a section in which the authors acknowledge entities such as funding, technical staff, colleagues, etc. that have contributed materials or knowledge or have influenced or inspired their work. Like a citation index, it measures influences on scientific work, but in a different sense; it measures institutional and economic influences as well as informal influences of individual people, ideas, and artifacts. Unlike the impact factor, it does not produce a single overall metric, but analyses the components separately. However, the total number of acknowledgements to an acknowledged entity can be measured and so can the number of citations to the papers in which the acknowledgement appears. The ratio of this total number of citations to the total number of papers in which the acknowledge entity appears can be construed as the impact of that acknowledged entity.

Grey literature are materials and research produced by organizations outside of the traditional commercial or academic publishing and distribution channels. Common grey literature publication types include reports, working papers, government documents, white papers and evaluations. Organizations that produce grey literature include government departments and agencies, civil society or non-governmental organisations, academic centres and departments, and private companies and consultants.

ScientificCommons is a project of the University of St. Gallen Institute for Media and Communications Management. The major aim of the project is to develop the world’s largest archive of scientific knowledge with fulltexts freely accessible to the public.

BASE (search engine) academic search engine

BASE is a multi-disciplinary search engine to scholarly internet resources, created by Bielefeld University Library in Bielefeld, Germany. It is based on free and open-source software such as Apache Solr and VuFind. It harvests OAI metadata from institutional repositories and other academic digital libraries that implement the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), and then normalizes and indexes the data for searching. In addition to OAI metadata, the library indexes selected web sites and local data collections, all of which can be searched via a single search interface.

Nature Precedings was an open access electronic preprint repository of scholarly work in the fields of biomedical sciences, chemistry, and earth sciences. It ceased accepting new submissions as of April 3, 2012.

Web of Science Online subscription index of citations

Web of Science is an online subscription-based scientific citation indexing service originally produced by the Institute for Scientific Information (ISI), later maintained by Clarivate Analytics, that provides a comprehensive citation search. It gives access to multiple databases that reference cross-disciplinary research, which allows for in-depth exploration of specialized sub-fields within an academic or scientific discipline.

ChemXSeer project, funded by the National Science Foundation, is a public integrated digital library, database, and search engine for scientific papers in chemistry. It is being developed by a multidisciplinary team of researchers at the Pennsylvania State University. ChemXSeer was conceived by Dr. Prasenjit Mitra, Dr. Lee Giles and Dr. Karl Mueller as a way to integrate the chemical scientific literature with experimental, analytical, and simulation data from different types of experimental systems. The goal of the project is to create an intelligent search and database which will provide access to relevant data to a diverse community of users who have a need for chemical information. It is hosted on the World Wide Web at the College of Information Sciences and Technology, The Pennsylvania State University.

The OpenSIGLE repository provides open access to the bibliographic records of the former SIGLE database. The creation of the OpenSIGLE archive was decided by some major European STI centres, members of the former European network EAGLE for the collection and dissemination of grey literature. OpenSIGLE was developed by the French INIST-CNRS, with assistance from the German FIZ Karlsruhe and the Dutch Grey Literature Network Service (GreyNet). OpenSIGLE is hosted on an INIST-CNRS server at Nancy. Part of the open Access movement, OpenSIGLE is referenced by the international Directory of Open Access Repositories.

Data publishing is the act of releasing research data in published form for (re)use by others. It is a practice consisting in preparing certain data or data set(s) for public use thus to make them available to everyone to use as they wish. This practice is an integral part of the open science movement. There is a large and multidisciplinary consensus on the benefits resulting from this practice.

References

  1. 1 2 "CiteSeerX Data Policy" . Retrieved 2015-11-10.
  2. 1 2 "About CiteSeerX" . Retrieved 2010-05-07.
  3. "The CiteSeerX Team". Pennsylvania State University. Retrieved 2018-05-01.[ dead link ]
  4. "Ranking Web of World Repositories: Top 800 Repositories". Cybermetrics Lab. July 2010. Archived from the original on 2010-07-24. Retrieved 2010-07-24.
  5. "About CiteSeerX Data". Pennsylvania State University. Retrieved 2012-01-25.
  6. For example, "CiteSeerx – DMCA Notice". CiteSeerX   10.1.1.604.4916 . The document with the identifier "10.1.1.604.4916" has been removed due to a DMCA takedown notice. If you believe the removal has been in error, please contact us through the feedback page, along with the identifier mentioned in this page.

Further reading