Lemur Project
   HOME
*





Lemur Project
The Lemur Project is a collaboration between the Center for Intelligent Information Retrieval at the University of Massachusetts Amherst and the Language Technologies Institute at Carnegie Mellon University. The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software. The project is best known for its Indri and Galago search engines, the ClueWeb09 and ClueWeb12 datasets, and the RankLib learning-to-rank library. The software and datasets are used widely in scientific and research applications, as well as in some commercial applications. The Lemur Project's software development philosophy emphasizes state-of-the-art accuracy, flexibility, and efficiency. For example, the Indri search engine provides accurate search for large text collections 'out of the box', and data is stored in an accessible manner to support development of new retrieval strategies. S ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

University Of Massachusetts Amherst
The University of Massachusetts Amherst (UMass Amherst, UMass) is a public research university in Amherst, Massachusetts and the sole public land-grant university in Commonwealth of Massachusetts. Founded in 1863 as an agricultural college, it is the flagship and the largest campus in the University of Massachusetts system, as well as the first established. It is also a member of the Five College Consortium, along with four other colleges in the Pioneer Valley: Amherst College, Smith College, Mount Holyoke College, and Hampshire College. As of Fall 2022, UMass Amherst has an annual enrollment of more than 32,000 students, along with approximately 1,900 faculty members. It is the largest university in Massachusetts by campus size and second largest university by enrollment in Massachusetts, after Boston University. The university offers academic degrees in 109 undergraduate, 77 master's and 48 doctoral programs. Programs are coordinated in nine schools and colleges. The Universit ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Language Technologies Institute
The Language Technologies Institute (LTI) is a research institute at Carnegie Mellon University in Pittsburgh, Pennsylvania, United States, and focuses on the area of language technologies. The institute is home to 33 faculty with the primary scholarly research of the institute focused on machine translation, speech recognition, speech synthesis, information retrieval, parsing, information extraction, and multimodal machine learning. Until 1996, the institute existed as the Center for Machine Translation, which was established in 1986. Subsequently, from 1996 onwards, it started awarding degrees, and the name was changed to The Language Technologies Institute. The institute was founded by Professor Jaime Carbonell, who served as director until his death in February 2020. He was followed bJamie Callan and then Carolyn Rosé, as interim directors. Academic programs The institute currently offers two Ph.D. programs, four different types of master degrees and an undergraduate minor. ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Carnegie Mellon University
Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology in 1912 and began granting four-year degrees in the same year. In 1967, the Carnegie Institute of Technology merged with the Mellon Institute of Industrial Research, founded in 1913 by Andrew Mellon and Richard B. Mellon and formerly a part of the University of Pittsburgh. Carnegie Mellon University has operated as a single institution since the merger. The university consists of seven colleges and independent schools: The College of Engineering, College of Fine Arts, Dietrich College of Humanities and Social Sciences, Mellon College of Science, Tepper School of Business, Heinz College of Information Systems and Public Policy, and the School of Computer Science. The university has its main campus located 5 miles (8 km) from Downto ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

C (programming Language)
C (''pronounced like the letter c'') is a General-purpose language, general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, protocol stacks, though decreasingly for application software. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems. A successor to the programming language B (programming language), B, C was originally developed at Bell Labs by Ritchie between 1972 and 1973 to construct utilities running on Unix. It was applied to re-implementing the kernel of the Unix operating system. During the 1980s, C gradually gained popularity. It has become one of the measuring programming language popularity, most widely used programming languages, with C compilers avail ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Java (programming Language)
Java is a high-level, class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers ''write once, run anywhere'' ( WORA), meaning that compiled Java code can run on all platforms that support Java without the need to recompile. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of the underlying computer architecture. The syntax of Java is similar to C and C++, but has fewer low-level facilities than either of them. The Java runtime provides dynamic capabilities (such as reflection and runtime code modification) that are typically not available in traditional compiled languages. , Java was one of the most popular programming languages in use according to GitHub, particularly for client–server web applications, with a reported 9 million developers. Java was originally developed ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Stemming
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation. A computer program or subroutine that stems word may be called a ''stemming program'', ''stemming algorithm'', or ''stemmer''. Examples A stemmer for English operating on the stem ''cat'' should identify such strings as ''cats'', ''catlike'', and ''catty''. A stemming algorithm might also reduce the words ''fishing'', ''fished'', and ''fisher'' to the stem ''fish''. The stem need not be a word, for examp ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Stop Words
Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in nformation retrieval systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever". History of stop words A predecessor concept was used in creating some concordances. For example, the first Hebrew concordance, Isaac Nathan ben Kalonymus's he, Me’ir Nativ, label=none, script=latn, contained a one-page list of unindexed words, with nonsubstantive prepositions and conjunctions w ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Tokenization (lexical Analysis)
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a ''lexer'', ''tokenizer'', or ''scanner'', although ''scanner'' is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth. Applications A lexer forms the first phase of a compiler frontend in modern processing. Analysis generally occurs in one pass. In older languages such as ALGOL, the initial stage was instead line reconstruction, which performed unstropping and removed whitespace and comments (and had scannerless parsers, with no separate lexer). These steps are now done as part of the lexer. Lexers and parsers are most often used for compilers, but ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Index (search Engine)
Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is ''web indexing''. Popular engines focus on the full-text indexing of online, natural language documents. Media types such as pictures, video, audio, and graphics are also searchable. Meta search engines reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines inde ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Relevance Feedback
Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback. Explicit feedback Explicit feedback is obtained from assessors of relevance indicating the relevance of a document retrieved for a query. This type of feedback is defined as explicit only when the assessors (or other users of a system) know that the feedback provided is interpreted as Relevance (information retrieval), relevance judgments. Users may indicate relevance explicitly using a ''binary'' or ''graded'' relevance system. Binary relevance feedback indicates that a document is either relevant or irrelevant for a given query. Graded relevance feedback i ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Wildcard Character
In software, a wildcard character is a kind of placeholder represented by a single character, such as an asterisk (), which can be interpreted as a number of literal characters or an empty string. It is often used in file searches so the full name need not be typed. Telecommunication In telecommunications, a wildcard is a character that may be substituted for any of a defined subset of all possible characters. * In high-frequency (HF) radio automatic link establishment, the wildcard character may be substituted for any one of the 36 upper-case alphanumeric characters. * Whether the wildcard character represents a single character or a string of characters must be specified. Computing In computer (software) technology, a wildcard is a symbol used to replace or represent one or more characters. Algorithms for matching wildcards have been developed in a number of recursive and non-recursive varieties. File and directory patterns When specifying file names (or paths) in CP/M, D ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]