Apache Lucene is a
free and open-source
Free and open-source software (FOSS) is a term used to refer to groups of software consisting of both free software and open-source software where anyone is freely licensed to use, copy, study, and change the software in any way, and the source ...
search engine
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
software library
In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and subr ...
, originally written in
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
by
Doug Cutting
Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation ...
. It is supported by the
Apache Software Foundation
The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the A ...
and is released under the
Apache Software License. Lucene is widely used as a standard foundation for non-research search applications.
Lucene has been ported to other programming languages including
Object Pascal
Object Pascal is an extension to the programming language Pascal (programming language), Pascal that provides object-oriented programming (OOP) features such as Class (computer programming), classes and Method (computer programming), methods.
...
,
Perl
Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
,
C#,
C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
,
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
,
Ruby
A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
and
PHP
PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group ...
.
History
Doug Cutting
Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation ...
originally wrote Lucene in 1999. Lucene was his fifth search engine, having previously written two while at Xerox PARC, one at Apple, and a fourth at Excite. It was initially available for download from its home at the
SourceForge
SourceForge is a web service that offers software consumers a centralized online location to control and manage open-source software projects and research business software. It provides source code repository hosting, bug tracking, mirrorin ...
web site. It joined the Apache Software Foundation's
Jakarta
Jakarta (; , bew, Jakarte), officially the Special Capital Region of Jakarta ( id, Daerah Khusus Ibukota Jakarta) is the capital and largest city of Indonesia. Lying on the northwest coast of Java, the world's most populous island, Jakarta ...
family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. The name Lucene is Doug Cutting's wife's middle name and her maternal grandmother's first name.
Lucene formerly included a number of sub-projects, such as Lucene.NET,
Mahout
A mahout is an elephant rider, trainer, or keeper. Mahouts were used since antiquity for both civilian and military use. Traditionally, mahouts came from ethnic groups with generations of elephant keeping experience, with a mahout retaining h ...
,
Tika and
Nutch
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Features
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archite ...
. These three are now independent top-level projects.
In March 2010, the
Apache Solr
Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
search server joined as a Lucene sub-project, merging the developer communities.
Version 4.0 was released on October 12, 2012.
In March 2021, Lucene changed its logo, and
Apache Solr
Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
became a top level Apache project again, independent from Lucene.
Features and common use
While suitable for any application that requires full text
indexing and searching capability, Lucene is recognized for its utility in the implementation of
Internet search engine
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a l ...
s and local, single-site searching.
Lucene includes a feature to perform a fuzzy search based on
edit distance
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
.
Lucene has also been used to implement recommendation systems. For example, Lucene's 'MoreLikeThis' Class can generate recommendations for similar documents. In a comparison of the term vector-based similarity approach of 'MoreLikeThis' with citation-based document similarity measures, such as
co-citation
Co-citation is the frequency with which two documents are ''cited'' together by other documents.. If at least one other document cites two documents in common, these documents are said to be ''co-cited''. The more co-citations two documents recei ...
and co-citation proximity analysis, Lucene's approach excelled at recommending documents with very similar structural characteristics and more narrow relatedness.
[M. Schwarzer, M. Schubotz, N. Meuschke, C. Breitinger, V. Markl, and B. Gipp, https://www.gipp.com/wp-content/papercite-data/pdf/schwarzer2016.pdf "Evaluating Link-based Recommendations for Wikipedia" in Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), New York, NY, USA, 2016, pp. 191-200.] In contrast, citation-based document similarity measures tended to be more suitable for recommending more broadly related documents,
meaning citation-based approaches may be more suitable for generating
serendipitous
Serendipity is an unplanned fortunate discovery. Serendipity is a common occurrence throughout the history of product invention and scientific discovery.
Etymology
The first noted use of "serendipity" was by Horace Walpole on 28 January 1754. ...
recommendations, as long as documents to be recommended contain in-text citations.
Lucene-based projects
Lucene itself is just an indexing and search library and does not contain
crawling and HTML
parsing
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
functionality. However, several projects extend Lucene's capability:
*
Apache Nutch
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Features
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architec ...
– provides
web crawling
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
and HTML parsing
*
Apache Solr
Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
– an enterprise search server
*
CrateDB
CrateDB is a distributed SQL database management system that integrates a fully searchable document-oriented data store. It is open-source, written in Java, based on a shared-nothing architecture, and designed for high scalability. CrateDB inc ...
– open source, distributed SQL database built on Lucene
*
DocFetcher
DocFetcher is a free and open source desktop search application. It runs on Windows, Mac OS X and Linux and is written in Java. The application has a graphical user interface, which is written using the Standard Widget Toolkits.
The program i ...
– a
multiplatform
In computing, cross-platform software (also called multi-platform software, platform-agnostic software, or platform-independent software) is computer software that is designed to work in several computing platforms. Some cross-platform software r ...
desktop search application
*
Elasticsearch
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-l ...
– an enterprise search server released in 2010
* Kinosearch – a search engine written in
Perl
Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
and
C and a loose
port
A port is a maritime facility comprising one or more wharves or loading areas, where ships load and discharge cargo and passengers. Although usually situated on a sea coast or estuary, ports can also be found far inland, such as Ham ...
of Lucene.
The
Socialtext
Socialtext Incorporated was a company based in Palo Alto, California, that produced enterprise social software for companies. It offered an integrated suite of wiki tools and social software applications, including microblogging, user profiles, ...
wiki software uses this search engine,
and so does the
MojoMojo
Catalyst is an open source web application framework written in Perl, that closely follows the model–view–controller (MVC) architecture, and supports a number of experimental web patterns. It is written using Moose, a modern object system fo ...
wiki.
It is also used by the
Human Metabolome Database
The Human Metabolome Database (HMDB) is a comprehensive, high-quality, freely accessible, online database of small molecule metabolites found in the human body. Created by the Human Metabolome Project funded by Genome Canada. One of the first d ...
(HMDB) and the
Toxin and Toxin-Target Database
The Toxin and Toxin-Target Database (T3DB), also known as the Toxic Exposome Database, is a freely accessible online database of common substances that are toxic to humans, along with their protein, DNA or organ Biological target, targets. The dat ...
(T3DB).
*
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
Atlas Search – a cloud-native enterprise search application based on MongoDB and Apache Lucene
*
OpenSearch
OpenSearch is a collection of technologies that allow the publishing of search results in a format suitable for syndication and aggregation. Introduced in 2005, it is a way for websites and search engines to publish search results in a standard ...
– an open source enterprise search server based on a fork of Elasticsearch 7
*
Swiftype
Swiftype is a search and index company based in San Francisco, California, that provides search software for organizations, websites, and computer programs. Notable customers include AT&T, Dr. Pepper, Hubspot and TechCrunch.
History
Swiftype was ...
– an enterprise search startup based on Lucene
See also
*
Enterprise search
Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.
"Enterprise search" is used to describe the software of search information within an ente ...
*
Information extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
List of information retrieval libraries
This is a list of free information retrieval libraries, which are libraries used in software development for performing it retrieval functions. It is not a complete list of such libraries, but is instead a list of free information retrieval lib ...
*
Text mining
Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
References
Bibliography
*
*
External links
*
{{Authority control
Lucene
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
Free search engine software
Java (programming language) libraries
C Sharp libraries
Cross-platform software
Software using the Apache license
Search engine software
Pascal (programming language) software
1999 software