HOME

TheInfoList



OR:

Apache Lucene is a
free and open-source Free and open-source software (FOSS) is a term used to refer to groups of software consisting of both free software and open-source software where anyone is freely licensed to use, copy, study, and change the software in any way, and the source ...
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
software library In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and subr ...
, originally written in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
by
Doug Cutting Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation ...
. It is supported by the
Apache Software Foundation The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the A ...
and is released under the Apache Software License. Lucene is widely used as a standard foundation for non-research search applications. Lucene has been ported to other programming languages including
Object Pascal Object Pascal is an extension to the programming language Pascal (programming language), Pascal that provides object-oriented programming (OOP) features such as Class (computer programming), classes and Method (computer programming), methods. ...
,
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
, C#,
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
,
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
and
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group ...
.


History

Doug Cutting Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation ...
originally wrote Lucene in 1999. Lucene was his fifth search engine, having previously written two while at Xerox PARC, one at Apple, and a fourth at Excite. It was initially available for download from its home at the
SourceForge SourceForge is a web service that offers software consumers a centralized online location to control and manage open-source software projects and research business software. It provides source code repository hosting, bug tracking, mirrorin ...
web site. It joined the Apache Software Foundation's
Jakarta Jakarta (; , bew, Jakarte), officially the Special Capital Region of Jakarta ( id, Daerah Khusus Ibukota Jakarta) is the capital and largest city of Indonesia. Lying on the northwest coast of Java, the world's most populous island, Jakarta ...
family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. The name Lucene is Doug Cutting's wife's middle name and her maternal grandmother's first name. Lucene formerly included a number of sub-projects, such as Lucene.NET,
Mahout A mahout is an elephant rider, trainer, or keeper. Mahouts were used since antiquity for both civilian and military use. Traditionally, mahouts came from ethnic groups with generations of elephant keeping experience, with a mahout retaining h ...
, Tika and
Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archite ...
. These three are now independent top-level projects. In March 2010, the
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
search server joined as a Lucene sub-project, merging the developer communities. Version 4.0 was released on October 12, 2012. In March 2021, Lucene changed its logo, and
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
became a top level Apache project again, independent from Lucene.


Features and common use

While suitable for any application that requires full text indexing and searching capability, Lucene is recognized for its utility in the implementation of
Internet search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a l ...
s and local, single-site searching. Lucene includes a feature to perform a fuzzy search based on
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
. Lucene has also been used to implement recommendation systems. For example, Lucene's 'MoreLikeThis' Class can generate recommendations for similar documents. In a comparison of the term vector-based similarity approach of 'MoreLikeThis' with citation-based document similarity measures, such as
co-citation Co-citation is the frequency with which two documents are ''cited'' together by other documents.. If at least one other document cites two documents in common, these documents are said to be ''co-cited''. The more co-citations two documents recei ...
and co-citation proximity analysis, Lucene's approach excelled at recommending documents with very similar structural characteristics and more narrow relatedness.M. Schwarzer, M. Schubotz, N. Meuschke, C. Breitinger, V. Markl, and B. Gipp, https://www.gipp.com/wp-content/papercite-data/pdf/schwarzer2016.pdf "Evaluating Link-based Recommendations for Wikipedia" in Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), New York, NY, USA, 2016, pp. 191-200. In contrast, citation-based document similarity measures tended to be more suitable for recommending more broadly related documents, meaning citation-based approaches may be more suitable for generating
serendipitous Serendipity is an unplanned fortunate discovery. Serendipity is a common occurrence throughout the history of product invention and scientific discovery. Etymology The first noted use of "serendipity" was by Horace Walpole on 28 January 1754. ...
recommendations, as long as documents to be recommended contain in-text citations.


Lucene-based projects

Lucene itself is just an indexing and search library and does not contain crawling and HTML
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
functionality. However, several projects extend Lucene's capability: *
Apache Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architec ...
– provides
web crawling A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
and HTML parsing *
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
– an enterprise search server *
CrateDB CrateDB is a distributed SQL database management system that integrates a fully searchable document-oriented data store. It is open-source, written in Java, based on a shared-nothing architecture, and designed for high scalability. CrateDB inc ...
– open source, distributed SQL database built on Lucene *
DocFetcher DocFetcher is a free and open source desktop search application. It runs on Windows, Mac OS X and Linux and is written in Java. The application has a graphical user interface, which is written using the Standard Widget Toolkits. The program i ...
– a
multiplatform In computing, cross-platform software (also called multi-platform software, platform-agnostic software, or platform-independent software) is computer software that is designed to work in several computing platforms. Some cross-platform software r ...
desktop search application *
Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-l ...
– an enterprise search server released in 2010 * Kinosearch – a search engine written in
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
and C and a loose
port A port is a maritime facility comprising one or more wharves or loading areas, where ships load and discharge cargo and passengers. Although usually situated on a sea coast or estuary, ports can also be found far inland, such as Ham ...
of Lucene. The
Socialtext Socialtext Incorporated was a company based in Palo Alto, California, that produced enterprise social software for companies. It offered an integrated suite of wiki tools and social software applications, including microblogging, user profiles, ...
wiki software uses this search engine, and so does the
MojoMojo Catalyst is an open source web application framework written in Perl, that closely follows the model–view–controller (MVC) architecture, and supports a number of experimental web patterns. It is written using Moose, a modern object system fo ...
wiki. It is also used by the
Human Metabolome Database The Human Metabolome Database (HMDB) is a comprehensive, high-quality, freely accessible, online database of small molecule metabolites found in the human body. Created by the Human Metabolome Project funded by Genome Canada. One of the first d ...
(HMDB) and the
Toxin and Toxin-Target Database The Toxin and Toxin-Target Database (T3DB), also known as the Toxic Exposome Database, is a freely accessible online database of common substances that are toxic to humans, along with their protein, DNA or organ Biological target, targets. The dat ...
(T3DB). *
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Serve ...
Atlas Search – a cloud-native enterprise search application based on MongoDB and Apache Lucene *
OpenSearch OpenSearch is a collection of technologies that allow the publishing of search results in a format suitable for syndication and aggregation. Introduced in 2005, it is a way for websites and search engines to publish search results in a standard ...
– an open source enterprise search server based on a fork of Elasticsearch 7 *
Swiftype Swiftype is a search and index company based in San Francisco, California, that provides search software for organizations, websites, and computer programs. Notable customers include AT&T, Dr. Pepper, Hubspot and TechCrunch. History Swiftype was ...
– an enterprise search startup based on Lucene


See also

*
Enterprise search Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience. "Enterprise search" is used to describe the software of search information within an ente ...
*
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
List of information retrieval libraries This is a list of free information retrieval libraries, which are libraries used in software development for performing it retrieval functions. It is not a complete list of such libraries, but is instead a list of free information retrieval lib ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...


References


Bibliography

* *


External links

* {{Authority control
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
Free search engine software Java (programming language) libraries C Sharp libraries Cross-platform software Software using the Apache license Search engine software Pascal (programming language) software 1999 software