HOME

TheInfoList



OR:

The Lemur Project is a collaboration between the Center for Intelligent Information Retrieval at the
University of Massachusetts Amherst The University of Massachusetts Amherst (UMass Amherst, UMass) is a public research university in Amherst, Massachusetts and the sole public land-grant university in Commonwealth of Massachusetts. Founded in 1863 as an agricultural college, it ...
and the
Language Technologies Institute The Language Technologies Institute (LTI) is a research institute at Carnegie Mellon University in Pittsburgh, Pennsylvania, United States, and focuses on the area of language technologies. The institute is home to 33 faculty with the primary scho ...
at
Carnegie Mellon University Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology ...
. The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software. The project is best known for its Indri and Galago search engines, the ClueWeb09 and ClueWeb12 datasets, and the RankLib learning-to-rank library. The software and datasets are used widely in scientific and research applications, as well as in some commercial applications. The Lemur Project's software development philosophy emphasizes state-of-the-art accuracy, flexibility, and efficiency. For example, the Indri search engine provides accurate search for large text collections 'out of the box', and data is stored in an accessible manner to support development of new retrieval strategies. Software from the Lemur Project is distributed under open-source licenses that provide flexibility to scientists and software developers. The programming languages used to create Lemur are C,
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
, and
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
, and it comes along with the source files and build instructions. The provided source code can be modified for the purpose of developing new libraries. It is compatible with various operating systems which include Linux and Windows.


Features

Lemur supports the following features: * Indexing: ** English, Chinese, and Arabic text ** Word
stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...
**
Stop words Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. There is no single universal list ...
**
Tokenization Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization (data security) in the field of data security * Word segmentation * Tokenism Tokenism is the practice of making only a perfunctory or symbolic ef ...
** Passage and incremental
index Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
ing * Retrieval: ** Ad hoc retrieval ( TF-IDF and InQuery) ** Passage and cross-lingual retrieval ** Language modeling *** Query model updating *** Two stage smoothing **
Relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not th ...
** Structured query language **
Wildcard Wild card most commonly refers to: * Wild card (cards), a playing card that substitutes for any other card in card games * Wild card (sports), a tournament or playoff place awarded to an individual or team that has not qualified through normal pla ...
term matching * Distributed IR: ** Query-based sampling ** Database based ranking (CORI) ** Results merging *
Document clustering Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering. Overview Document clusteri ...
* Summarization * Simple text processing


Components

Lemur Project has the following components: * Indri search engine in C++ * Galago search engine research framework in Java * RankLib learning-to-rank library * Sifaka data mining application * ClueWeb09 and ClueWeb12 datasets * Query Log Toolbar


Latest Version

Updates to the Lemur Project components are made twice a year, in June and December. The latest version of the Indri search engine is 5.17. The latest version of the Galago search engine is version 3.18. The latest version of the RankLib learning-to-rank library is 2.14. The latest version of the Sifaka data mining application is 1.8.


Indri Search Engine

The Indri search engine is one of the components developed by the Lemur Project. It is open source. The query language that is used in Indri allows researchers to index data or structure documents using simple command line instructions. Indri offers flexibility in terms of adaptation to various current applications. It also can be distributed across a cluster of nodes for high performance. The Indri search engine can handle large collections of data and can understand various data formats like
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
and
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
. The Indri API supports various programming and scripting languages like C++,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
, C#, and
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group ...
.


Features of Indri Search Engine

* Can make use of multiple document representations * Explicit term weighting * Robust query language * Formally well-grounded * Highly effective * Can be efficiently implemented


See also

*
List of information retrieval libraries This is a list of free information retrieval libraries, which are libraries used in software development for performing it retrieval functions. It is not a complete list of such libraries, but is instead a list of free information retrieval lib ...


External links


The Lemur Project website
Free software projects {{Free-software-stub