CiteSeer
X (formerly called CiteSeer) is a public
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
and
digital library
A digital library (also called an online library, an internet library, a digital repository, a library without walls, or a digital collection) is an online database of digital resources that can include text, still images, audio, video, digital ...
for scientific and academic papers, primarily in the fields of
computer
A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...
and
information science.
CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered part of the
open access
Open access (OA) is a set of principles and a range of practices through which nominally copyrightable publications are delivered to readers free of access charges or other barriers. With open access strictly defined (according to the 2001 de ...
movement that is attempting to change
academic and scientific publishing to allow greater access to scientific literature. CiteSeer freely provided
Open Archives Initiative
The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives t ...
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
of all indexed documents and links indexed documents when possible to other sources of metadata such as
DBLP
DBLP is a computer science bibliography website. Starting in 1993 at Universität Trier in Germany, it grew from a small collection of HTML files and became an organization hosting a database and logic programming bibliography site. Since Novem ...
and the
ACM Portal. To promote
open data
Open data are data that are openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data are generally licensed under an open license.
The goals of the open data movement are similar to those of other "open(-so ...
, CiteSeer
X shares its data for non-commercial purposes under a
Creative Commons license
A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work". A CC license is used when an author wants to give other people the right to share, use, and bu ...
.
CiteSeer is considered a predecessor of academic search tools such as
Google Scholar
Google Scholar is a freely accessible web search engine that indexes the full text or metadata of Academic publishing, scholarly literature across an array of publishing formats and disciplines. Released in Beta release, beta in November 2004, th ...
and
Microsoft Academic Search
Microsoft Academic Search (MAS) was a research project and academic search engine retired in 2012. It relaunched in 2016 as Microsoft Academic, which in turn was shut down in 2022. The content of the latter was allegedly incorporated into The L ...
. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.
CiteSeer changed its name to ResearchIndex at one point and then changed it back.
History
CiteSeer and CiteSeer.IST
CiteSeer was created by researchers
Lee Giles,
Kurt Bollacker and
Steve Lawrence in 1997 while they were at the
NEC Research Institute (now
NEC Labs),
Princeton, New Jersey
The Municipality of Princeton is a Borough (New Jersey), borough in Mercer County, New Jersey, United States. It was established on January 1, 2013, through the consolidation of the Borough of Princeton, New Jersey, Borough of Princeton and Pri ...
, US. CiteSeer's goal was to actively crawl and harvest academic and scientific documents on the web and use autonomous
citation index
A citation index is a kind of bibliographic index, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. A form of citation index is first found in 12th-century H ...
ing to permit querying by citation or by document, ranking them by
citation impact
Citation impact or citation rate is a measure of how many times an academic journal article or book or author is cited by other articles, books or authors.
Citation counts are interpreted as measures of the impact or influence of academic work a ...
. At one point, it was called ResearchIndex.
CiteSeer became public in 1998 and had many new features unavailable in academic search engines at that time. These included:
* Autonomous Citation Indexing automatically created a citation index that can be used for literature search and evaluation.
* Citation statistics and related documents were computed for all articles cited in the database, not just the indexed articles.
* Reference linking, allowing browsing of the database using citation links.
* Citation context showed the context of citations to a given paper, allowing a researcher to quickly and easily see what other researchers have to say about an article of interest.
* Related documents were shown using citation and word based measures, and an active and continuously updated bibliography is shown for each document.
CiteSeer was granted a United States
patent
A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an sufficiency of disclosure, enabling discl ...
# 6289342, titled "''Autonomous citation indexing and literature browsing using citation context''", on September 11, 2001. The patent was filed on May 20, 1998, and has priority to January 5, 1998. A continuation patent (US Patent # 6738780) was filed on May 16, 2001, and granted on May 18, 2004.
After NEC, in 2004 it was hosted as CiteSeer.IST on the
World Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
at the College of Information Sciences and Technology, The
Pennsylvania State University
The Pennsylvania State University (Penn State or PSU) is a Public university, public Commonwealth System of Higher Education, state-related Land-grant university, land-grant research university with campuses and facilities throughout Pennsyl ...
, and had over 700,000 documents. For enhanced access, performance and research, similar versions of CiteSeer were supported at universities such as the
Massachusetts Institute of Technology
The Massachusetts Institute of Technology (MIT) is a Private university, private research university in Cambridge, Massachusetts, United States. Established in 1861, MIT has played a significant role in the development of many areas of moder ...
,
University of Zürich
The University of Zurich (UZH, ) is a public university, public research university in Zurich, Switzerland. It is the largest university in Switzerland, with its 28,000 enrolled students. It was founded in 1833 from the existing colleges of the ...
and the
National University of Singapore
The National University of Singapore (NUS) is a national university, national Public university, public research university in Singapore. It was officially established in 1980 by the merging of the University of Singapore and Nanyang University ...
. However, these versions of CiteSeer proved difficult to maintain and are no longer available. Because CiteSeer only indexes freely available papers on the web and does not have access to publisher metadata, it returns fewer citation counts than sites, such as
Google Scholar
Google Scholar is a freely accessible web search engine that indexes the full text or metadata of Academic publishing, scholarly literature across an array of publishing formats and disciplines. Released in Beta release, beta in November 2004, th ...
, that have publisher metadata.
CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it was limited to papers that are publicly available, usually at an author's homepage, or those submitted by an author. To overcome some of these limitations, a modular and open source architecture for CiteSeer was designed – CiteSeer
X.
CiteSeerX
CiteSeer
X replaced CiteSeer and all queries to CiteSeer were redirected. CiteSeer
X is a public
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
and
digital library
A digital library (also called an online library, an internet library, a digital repository, a library without walls, or a digital collection) is an online database of digital resources that can include text, still images, audio, video, digital ...
and
repository for scientific and academic papers, primarily with a focus on
computer
A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...
and
information science.
However, recently CiteSeer
X has been expanding into other scholarly domains such as economics, physics and others. Released in 2008, it was loosely based on the previous CiteSeer search engine and digital library and is built with a new
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
infrastructure, SeerSuite, and new algorithms and their implementations. It was developed by researchers Isaac Councill and C.
Lee Giles at
the College of Information Sciences and Technology,
Pennsylvania State University
The Pennsylvania State University (Penn State or PSU) is a Public university, public Commonwealth System of Higher Education, state-related Land-grant university, land-grant research university with campuses and facilities throughout Pennsyl ...
. It continues to support the goals outlined by CiteSeer to actively crawl and harvest academic and scientific documents on the public web and to use a citation inquiry by citations and ranking of documents by the impact of citations. Currently, Lee Giles, Prasenjit Mitra, Susan Gauch, Min-Yen Kan, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Pucktada Treeratpituk, Jian Wu, Douglas Jordan, Steve Carman, Jack Carroll, Jim Jansen, and Shuyi Zheng are or have been actively involved in its development. Recently, a table search feature was introduced. It has been funded by the
National Science Foundation
The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...
,
NASA
The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the federal government of the United States, US federal government responsible for the United States ...
, and
Microsoft Research
Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...
.
CiteSeer
X continues to be rated as one of the world's top repositories, and was rated number 1 in July 2010. It currently has over 6 million documents with nearly 6 million unique authors and 120 million citations.
CiteSeer
X also shares its software, data, databases and metadata with other researchers, currently by
Amazon S3
Amazon Simple Storage Service (S3) is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerc ...
and by
rsync. Its new modular open source architecture and software (available previously on
SourceForge
SourceForge is a web service founded by Geoffrey B. Jeffery, Tim Perdue, and Drew Streib in November 1999. SourceForge provides a centralized software discovery platform, including an online platform for managing and hosting open-source soft ...
but now on
GitHub
GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...
) is built on
Apache Solr and other
Apache
The Apache ( ) are several Southern Athabaskan language-speaking peoples of the Southwestern United States, Southwest, the Southern Plains and Northern Mexico. They are linguistically related to the Navajo. They migrated from the Athabascan ho ...
and open source tools, which allows it to be a testbed for new algorithms in document harvesting, ranking, indexing, and information extraction.
CiteSeer
X caches some PDF files that it has scanned. As such, each page includes a
DMCA
The Digital Millennium Copyright Act (DMCA) is a 1998 United States copyright law that implements two 1996 treaties of the World Intellectual Property Organization (WIPO). It criminalizes production and dissemination of technology, devices, or ...
link which can be used to report copyright violations.
Current features
Automated information extraction
CiteSeer
X uses automated
information extraction tools, usually built on machine learning methods such ParsCit, to extract scholarly document metadata such as title, authors, abstract, citations, etc. As such, there are sometime errors in authors and titles. Other academic search engines have similar errors.
Focused crawling
CiteSeer
X crawls publicly available scholarly documents primarily from author webpages and other open resources, and does not have access to publisher metadata. As such, citation counts in CiteSeer
X are usually less than those in Google Scholar and Microsoft Academic Search who have access to publisher metadata.
Usage
CiteSeer
X has nearly one million users worldwide based on unique IP addresses and has millions of hits daily. Annual downloads of document PDFs were nearly 200 million for 2015.
Data
CiteSeer
X data is regularly shared under a
Creative Commons BY-NC-SA license with researchers worldwide and has been and is used in many experiments and competitions.
Thanks to its
OAI-PMH endpoint,
CiteSeerX is an
open archive
An open repository or open-access repository is a digital platform that holds research output and provides free, immediate and permanent access to research results for anyone to use, download and distribute. To facilitate open access such reposito ...
and its content is indexed like an
institutional repository
An institutional repository (IR) is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published ...
in
academic search engines, for instance
BASE and
Unpaywall
OurResearch, formerly known as ImpactStory, is a nonprofit organization that creates and distributes tools and services for libraries, institutions and researchers. The organization follows open practices with their data (to the extent allowed b ...
consumers.
Other SeerSuite-based search engines
The CiteSeer model had been extended to cover academic documents in business with
SmealSearch and in e-business with
eBizSearch. However, these were not maintained by their sponsors. An older version of both of these could be once found at
BizSeer.IST but is no longer in service.
Other Seer-like search and repository systems have been built for chemistry,
ChemXSeer and for archaeology, ArchSeer. Another had been built for robots.txt file search,
BotSeer. All of these are built on the open source tool
SeerSuite, which uses the open source indexer
Lucene
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...
.
See also
*
Arnetminer
*
arXiv
arXiv (pronounced as "archive"—the X represents the Chi (letter), Greek letter chi ⟨χ⟩) is an open-access repository of electronic preprints and postprints (known as e-prints) approved for posting after moderation, but not Scholarly pee ...
*
Collection of Computer Science Bibliographies
*
DBLP
DBLP is a computer science bibliography website. Starting in 1993 at Universität Trier in Germany, it grew from a small collection of HTML files and became an organization hosting a database and logic programming bibliography site. Since Novem ...
(Digital Bibliography & Library Project)
*
Disciplinary repository
A disciplinary repository (or subject repository) is an online archive, often an open-access repository, containing works or data associated with these works of scholars in a particular subject area. Disciplinary repositories can accept work f ...
*
Google Scholar
Google Scholar is a freely accessible web search engine that indexes the full text or metadata of Academic publishing, scholarly literature across an array of publishing formats and disciplines. Released in Beta release, beta in November 2004, th ...
*
List of academic databases and search engines
This page contains a representative list of major databases and search engines useful in an academic setting for finding and accessing articles in academic journals, institutional repository, institutional repositories, archives, or other collecti ...
*
Microsoft Academic
Microsoft Academic was a free internet-based academic search engine for academic publications and literature, developed by Microsoft Research in 2016 as a successor of Microsoft Academic Search. Microsoft Academic was shut down in 2022. Both ...
*
Research Papers in Economics
Research Papers in Economics (RePEc) is a collaborative effort of hundreds of volunteers in many countries to enhance the dissemination of research in economics. The heart of the project is a decentralized database of working papers, preprints, ...
(RePEc)
*
Semantic Scholar
References
Further reading
*
External links
*
{{DEFAULTSORT:Citeseer
Bibliographic databases in computer science
Eprint archives
Internet search engines
Library 2.0
Online databases
Open-access archives
Pennsylvania State University
Scholarly search services
American digital libraries