CiteSeerX (identifier)
   HOME

TheInfoList



OR:

CiteSeerX (formerly called CiteSeer) is a public search engine and
digital library A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital ...
for scientific and academic papers, primarily in the fields of computer and
information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of informatio ...
. CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered as part of the open access movement that is attempting to change academic and scientific publishing to allow greater access to scientific literature. CiteSeer freely provided
Open Archives Initiative The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives to ...
metadata of all indexed documents and links indexed documents when possible to other sources of metadata such as
DBLP DBLP is a computer science bibliography website. Starting in 1993 at Universität Trier in Germany, it grew from a small collection of HTML files and became an organization hosting a database and logic programming bibliography site. Since Novem ...
and the ACM Portal. To promote
open data Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license. The goals of the open data movement are similar to those of other "open(-source)" movement ...
, CiteSeerX shares its data for non-commercial purposes under a
Creative Commons license A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
. CiteSeer is considered as a predecessor of academic search tools such as
Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes ...
and
Microsoft Academic Search Microsoft Academic Search was a research project and academic search engine retired in 2012. It relaunched in 2016 as Academic. History Microsoft launched a search tool called Windows Live Academic Search in 2006 to directly compete with Google ...
. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index. CiteSeer changed its name to ResearchIndex at one point and then changed it back.


History


CiteSeer and CiteSeer.IST

CiteSeer was created by researchers
Lee Giles Clyde Lee Giles is an American computer scientist and the David Reese Professor at the College of Information Sciences and Technology at the Pennsylvania State University. He is also Graduate Faculty Professor of Computer Science and Engineering ...
,
Kurt Bollacker Kurt Bollacker is an American computer scientist with a research background in the areas of machine learning, digital libraries, semantic networks, and electro-cardiographic modeling. He received a Ph.D. in Computer Engineering from The Universi ...
and
Steve Lawrence Steve Lawrence (born Sidney Liebowitz; July 8, 1935) is an American singer, comedian and actor, best known as a member of a duo with his wife Eydie Gormé, billed as " Steve and Eydie", and for his performance as Maury Sline, the manager and f ...
in 1997 while they were at the
NEC Research Institute NEC Corporation of America (NECAM) is the principal subsidiary of the multinational IT company NEC in the United States. NEC Corporation of America was formed on July 1, 2006, from the combined operations of NEC America, NEC Solutions (America) ...
(now NEC Labs),
Princeton, New Jersey Princeton is a municipality with a borough form of government in Mercer County, in the U.S. state of New Jersey. It was established on January 1, 2013, through the consolidation of the Borough of Princeton and Princeton Township, both of whi ...
, USA. CiteSeer's goal was to actively crawl and harvest academic and scientific documents on the web and use autonomous
citation index A citation index is a kind of bibliographic index, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. A form of citation index is first found in 12th-century Hebre ...
ing to permit querying by citation or by document, ranking them by
citation impact Citation impact is a measure of how many times an academic journal article or book or author is cited by other articles, books or authors. Citation counts are interpreted as measures of the impact or influence of academic work and have given ris ...
. At one point, it was called ResearchIndex. CiteSeer became public in 1998 and had many new features unavailable in academic search engines at that time. These included: * Autonomous Citation Indexing automatically created a citation index that can be used for literature search and evaluation. * Citation statistics and related documents were computed for all articles cited in the database, not just the indexed articles. * Reference linking allowing browsing of the database using citation links. * Citation context showed the context of citations to a given paper, allowing a researcher to quickly and easily see what other researchers have to say about an article of interest. * Related documents were shown using citation and word based measures and an active and continuously updated bibliography is shown for each document. CiteSeer was granted a United States
patent A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an enabling disclosure of the invention."A ...
# 6289342, titled "''Autonomous citation indexing and literature browsing using citation context''", on September 11, 2001. The patent was filed on May 20, 1998, and has priority to January 5, 1998. A continuation patent (US Patent # 6738780) was filed on May 16, 2001, and granted on May 18, 2004. After NEC, in 2004 it was hosted as CiteSeer.IST on the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web ...
at the College of Information Sciences and Technology, The Pennsylvania State University, and had over 700,000 documents. For enhanced access, performance and research, similar versions of CiteSeer were supported at universities such as the
Massachusetts Institute of Technology The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the ...
,
University of Zürich The University of Zürich (UZH, german: Universität Zürich) is a public research university located in the city of Zürich, Switzerland. It is the largest university in Switzerland, with its 28,000 enrolled students. It was founded in 1833 f ...
and the National University of Singapore. However, these versions of CiteSeer proved difficult to maintain and are no longer available. Because CiteSeer only indexes freely available papers on the web and does not have access to publisher metadata, it returns fewer citation counts than sites, such as
Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes ...
, that have publisher metadata. CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it was limited to papers that are publicly available, usually at an author's homepage, or those submitted by an author. To overcome some of these limitations, a modular and open source architecture for CiteSeer was designed – CiteSeerX.


CiteSeerX

CiteSeerX replaced CiteSeer and all queries to CiteSeer were redirected. CiteSeerX is a public search engine and
digital library A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital ...
and
repository Repository may refer to: Archives and online databases * Content repository, a database with an associated set of data management tools, allowing application-independent access to the content * Disciplinary repository (or subject repository), an ...
for scientific and academic papers primarily with a focus on computer and
information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of informatio ...
. However, recently CiteSeerX has been expanding into other scholarly domains such as economics, physics and others. Released in 2008, it was loosely based on the previous CiteSeer search engine and digital library and is built with a new open source infrastructure, SeerSuite, and new algorithms and their implementations. It was developed by researchers Dr. Isaac Councill and Dr. C.
Lee Giles Clyde Lee Giles is an American computer scientist and the David Reese Professor at the College of Information Sciences and Technology at the Pennsylvania State University. He is also Graduate Faculty Professor of Computer Science and Engineering ...
at the College of Information Sciences and Technology, Pennsylvania State University. It continues to support the goals outlined by CiteSeer to actively crawl and harvest academic and scientific documents on the public web and to use a citation inquiry by citations and ranking of documents by the impact of citations. Currently, Lee Giles, Prasenjit Mitra, Susan Gauch, Min-Yen Kan, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Pucktada Treeratpituk, Jian Wu, Douglas Jordan, Steve Carman, Jack Carroll, Jim Jansen, and Shuyi Zheng are or have been actively involved in its development. Recently, a table search feature was introduced. It has been funded by the
National Science Foundation The National Science Foundation (NSF) is an independent agency of the United States government that supports fundamental research and education in all the non-medical fields of science and engineering. Its medical counterpart is the National ...
,
NASA The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the US federal government responsible for the civil List of government space agencies, space program ...
, and Microsoft Research. CiteSeerX continues to be rated as one of the world's top repositories and was rated number 1 in July 2010. It currently has over 6 million documents with nearly 6 million unique authors and 120 million citations. CiteSeerX also shares its software, data, databases and metadata with other researchers, currently by
Amazon S3 Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its ...
and by
rsync rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like opera ...
. Its new modular open source architecture and software (available previously on
SourceForge SourceForge is a web service that offers software consumers a centralized online location to control and manage open-source software projects and research business software. It provides source code repository hosting, bug tracking, mirroring ...
but now on
GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continu ...
) is built on
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
and other Apache and open source tools which allows it to be a testbed for new algorithms in document harvesting, ranking, indexing, and information extraction. CiteSeerX caches some PDF files that it has scanned. As such, each page include a
DMCA The Digital Millennium Copyright Act (DMCA) is a 1998 United States copyright law that implements two 1996 treaties of the World Intellectual Property Organization (WIPO). It criminalizes production and dissemination of technology, devices, or ...
link which can be used to report copyright violations.


Current features


Automated information extraction

CiteSeerX uses automated
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
tools, usually built on machine learning methods such ParsCit, to extract scholarly document metadata such as title, authors, abstract, citations, etc. As such, there are sometime errors in authors and titles. Other academic search engines have similar errors.


Focused crawling

CiteSeerX crawls publicly available scholarly documents primarily from author webpages and other open resources, and does not have access to publisher metadata. As such citation counts in CiteSeerX are usually less than those in Google Scholar and Microsoft Academic Search who have access to publisher metadata.


Usage

CiteSeerX has nearly 1 million users worldwide based on unique IP addresses and has millions of hits daily. Annual downloads of document PDFs was nearly 200 million for 2015.


Data

CiteSeerX data is regularly shared under a Creative Commons BY-NC-SA license with researchers worldwide and has been and is used in many experiments and competitions. Thanks to its
OAI-PMH The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI- ...
endpoint, CiteSeerX is an
open archive An open repository or open-access repository is a digital platform that holds research output and provides free, immediate and permanent access to research results for anyone to use, download and distribute. To facilitate open access such repositori ...
and its content is indexed like an
institutional repository An institutional repository is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published work ...
in
academic search engines This article contains a representative list of notable databases and search engines useful in an academic setting for finding and accessing articles in academic journals, institutional repositories, archives, or other collections of scientific and ...
, for instance BASE and
Unpaywall OurResearch, formerly known as ImpactStory, is a nonprofit organization which creates and distributes tools and services for libraries, institutions and researchers. The organization follows open practices with their data (to the extent allowed by ...
consumers.


Other SeerSuite-based search engines

The CiteSeer model had been extended to cover academic documents in business with SmealSearch and in e-business with eBizSearch. However, these were not maintained by their sponsors. An older version of both of these could be once found at BizSeer.IST but is no longer in service. Other Seer-like search and repository systems have been built for chemistry, ChemXSeer and for archaeology, ArchSeer. Another had been built for robots.txt file search, BotSeer. All of these are built on the open source tool SeerSuite, which uses the open source indexer
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
.


See also

*
Arnetminer AMiner (formerly ArnetMiner) is a free online service used to index, search, and mine big scientific data. Overview AMiner (ArnetMiner) is designed to search and perform data mining operations against academic publications on the Internet, using ...
*
arXiv arXiv (pronounced "archive"—the X represents the Greek letter chi ⟨χ⟩) is an open-access repository of electronic preprints and postprints (known as e-prints) approved for posting after moderation, but not peer review. It consists of ...
* Collection of Computer Science Bibliographies *
DBLP DBLP is a computer science bibliography website. Starting in 1993 at Universität Trier in Germany, it grew from a small collection of HTML files and became an organization hosting a database and logic programming bibliography site. Since Novem ...
(Digital Bibliography & Library Project) *
Disciplinary repository A disciplinary repository (or subject repository) is an online archive containing works or data associated with these works of scholars in a particular subject area. Disciplinary repositories can accept work from scholars from any institution. A ...
*
Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes ...
* List of academic databases and search engines * Microsoft Academic *
Research Papers in Economics Research Papers in Economics (RePEc) is a collaborative effort of hundreds of volunteers in many countries to enhance the dissemination of research in economics. The heart of the project is a decentralized database of working papers, preprints, ...
(RePEc) * Semantic Scholar


References


Further reading

*


External links

* {{DEFAULTSORT:Citeseer Bibliographic databases in computer science Eprint archives Internet search engines Library 2.0 Online databases Open-access archives Pennsylvania State University Scholarly search services American digital libraries