Apache Lucene is a
free and open-source
Free and open-source software (FOSS) is software available under a Software license, license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term ...
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
software library
In computing, a library is a collection of resources that can be leveraged during software development to implement a computer program. Commonly, a library consists of executable code such as compiled functions and classes, or a library can ...
, originally written in
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
by
Doug Cutting. It is supported by the
Apache Software Foundation
The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
and is released under the
Apache Software License. Lucene is widely used as a standard foundation for production search applications.
Lucene has been ported to other programming languages including
Object Pascal
Object Pascal is an extension to the programming language Pascal (programming language), Pascal that provides object-oriented programming (OOP) features such as Class (computer programming), classes and Method (computer programming), methods.
T ...
,
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language".
Perl was developed ...
,
C#,
C++,
Python,
Ruby
Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...
and
PHP.
History
Doug Cutting originally wrote Lucene in 1999. Lucene was his fifth search engine. He had previously written two while at
Xerox PARC, one at
Apple
An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
, and a fourth at
Excite. It was initially available for download from its home at the
SourceForge
SourceForge is a web service founded by Geoffrey B. Jeffery, Tim Perdue, and Drew Streib in November 1999. SourceForge provides a centralized software discovery platform, including an online platform for managing and hosting open-source soft ...
web site. It joined the Apache Software Foundation's
Jakarta
Jakarta (; , Betawi language, Betawi: ''Jakartè''), officially the Special Capital Region of Jakarta (; ''DKI Jakarta'') and formerly known as Batavia, Dutch East Indies, Batavia until 1949, is the capital and largest city of Indonesia and ...
family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. The name Lucene is Doug Cutting's wife's middle name and her maternal grandmother's first name.
Lucene formerly included a number of sub-projects, such as Lucene.NET,
Mahout,
Tika and
Nutch. These three are now independent top-level projects.
In March 2010, the
Apache Solr search server joined as a Lucene sub-project, merging the developer communities.
Version 4.0 was released on October 12, 2012.
In March 2021, Lucene changed its logo, and
Apache Solr became a top level Apache project again, independent from Lucene.
Features and common use
While suitable for any application that requires full text
indexing and searching capability, Lucene is recognized for its utility in the implementation of
Internet search engines and local, single-site searching.
Lucene includes a feature to perform a fuzzy search based on
edit distance
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...
.
Lucene has also been used to implement recommendation systems. For example, Lucene's 'MoreLikeThis' Class can generate recommendations for similar documents. In a comparison of the term vector-based similarity approach of 'MoreLikeThis' with citation-based document similarity measures, such as
co-citation and co-citation proximity analysis, Lucene's approach excelled at recommending documents with very similar structural characteristics and more narrow relatedness.
[M. Schwarzer, M. Schubotz, N. Meuschke, C. Breitinger, V. Markl, and B. Gipp, https://www.gipp.com/wp-content/papercite-data/pdf/schwarzer2016.pdf "Evaluating Link-based Recommendations for Wikipedia" in Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), New York, NY, USA, 2016, pp. 191-200.] In contrast, citation-based document similarity measures tended to be more suitable for recommending more broadly related documents,
meaning citation-based approaches may be more suitable for generating
serendipitous recommendations, as long as documents to be recommended contain in-text citations.
Lucene-based projects
Lucene itself is just an indexing and search library and does not contain
crawling and HTML
parsing
Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
functionality. However, several projects extend Lucene's capability:
*
Apache Nutch – provides
web crawling and HTML parsing
*
Apache Solr – an enterprise search server
*
CrateDB – open source, distributed SQL database built on Lucene
*
DocFetcher – a
multiplatform desktop search application
*
Elasticsearch
Elasticsearch is a Search engine (computing), search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, Multitenancy, multitenant-capable full-text search engine with an HTTP web interface and schema ...
– an enterprise search server released in 2010
* Kinosearch – a search engine written in
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language".
Perl was developed ...
and
C and a loose
port
A port is a maritime facility comprising one or more wharves or loading areas, where ships load and discharge cargo and passengers. Although usually situated on a sea coast or estuary, ports can also be found far inland, such as Hamburg, Manch ...
of Lucene.
The
Socialtext wiki software uses this search engine,
and so does the
MojoMojo wiki.
It is also used by the
Human Metabolome Database (HMDB) and the
Toxin and Toxin-Target Database (T3DB).
*
MongoDB
MongoDB is a source-available, cross-platform, document-oriented database program. Classified as a NoSQL database product, MongoDB uses JSON-like documents with optional database schema, schemas. Released in February 2009 by 10gen (now MongoDB ...
Atlas Search – a cloud-native enterprise search application based on MongoDB and Apache Lucene
*
OpenSearch – an open source enterprise search server based on a fork of Elasticsearch 7
*
Swiftype – an enterprise search startup based on Lucene
See also
*
Enterprise search
*
Information extraction
*
Information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
*
Text mining
References
Bibliography
*
*
External links
*
{{Authority control
Lucene
Free search engine software
Java (programming language) libraries
C Sharp libraries
Cross-platform software
Software using the Apache license
Search engine software
Pascal (programming language) software
1999 software