Vocabulary Mismatch

	Vocabulary Mismatch Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently. Furnas et al. (1987) were perhaps the first to quantitatively study the vocabulary mismatch problem. Their results show that on average 80% of the times different people (experts in the same field) will name the same thing differently. There are usually tens of possible names that can be attributed to the same thing. This research motivated the work on latent semantic indexing. The vocabulary mismatch between user created queries and relevant documents in a corpus causes the term mismatch problem in information retrieval. Zhao and Callan (2010)Zhao, L. and Callan, J., Term Necessity Prediction, Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM 2010). Toronto, Canada, 2010. were perhaps the first to quantitatively study the vocabulary mismatch problem in a retrieval setting. Their results sho ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Latent Semantic Indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents. An information retrieval technique using latent semantic structure was patented in 1988US Patent 4,839, ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Information Retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds. Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals and other documents; stores and manages those documents. Web search engines are the most visible IR applications. Overview An information retrieval process begins when a user or searcher enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In inf ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Binary Independence Model The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible. Definitions The Binary Independence Assumption is that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. Terms are independently distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents. The representation is an ordered set of Boolean variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector where if term ''t'' is present in the document ''d'' and if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way. "Independ ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation. A computer program or subroutine that stems word may be called a ''stemming program'', ''stemming algorithm'', or ''stemmer''. Examples A stemmer for English operating on the stem ''cat'' should identify such strings as ''cats'', ''catlike'', and ''catty''. A stemming algorithm might also reduce the words ''fishing'', ''fished'', and ''fisher'' to the stem ''fish''. The stem need not be a word, for examp ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Full-text Indexing In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques became common in online bibliographic databases in the 1990s. Many websites and application programs (such as word processing software) provide full-text-search capabilities. Some web search engines, such as AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems. Indexing When dealing with a small number of documents, it is possible for the full-text-search engine ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Folksonomy Folksonomy is a classification system in which end users apply public tags to online items, typically to make those items easier for themselves or others to find later. Over time, this can give rise to a classification system based on those tags and how often they are applied or searched for, in contrast to a taxonomic classification designed by the owners of the content and specified when it is published. This practice is also known as collaborative tagging, social classification, social indexing, and social tagging. Folksonomy was originally "the result of personal free tagging of information ..for one's own retrieval", but online sharing and interaction expanded it into collaborative forms. ''Social tagging'' is the application of tags in an open online environment where the tags of other users are available to others. ''Collaborative tagging'' (also known as group tagging) is tagging performed by a group of users. This type of folksonomy is commonly used in cooperative and col ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Query Expansion Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input (what words were typed into the search query area, and sometimes other types of data) and expanding the search query to match additional documents. Query expansion involves techniques such as: * Finding synonyms of words, and searching for the synonyms as well * Finding semantically related words (e.g. antonyms, meronyms, hyponyms, hypernyms) * Finding all the various morphological forms of words by stemming each word in the search query * Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results * Re-weighting the terms in the original query Query expansion is a methodology studied in the field of computer science, particularly within the realm of natural langu ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Conjunctive Normal Form In Boolean logic, a formula is in conjunctive normal form (CNF) or clausal normal form if it is a conjunction of one or more clauses, where a clause is a disjunction of literals; otherwise put, it is a product of sums or an AND of ORs. As a canonical normal form, it is useful in automated theorem proving and circuit theory. All conjunctions of literals and all disjunctions of literals are in CNF, as they can be seen as conjunctions of one-literal clauses and conjunctions of a single clause, respectively. As in the disjunctive normal form (DNF), the only propositional connectives a formula in CNF can contain are and, or, and not. The not operator can only be used as part of a literal, which means that it can only precede a propositional variable or a predicate symbol. In automated theorem proving, the notion "''clausal normal form''" is often used in a narrower sense, meaning a particular representation of a CNF formula as a set of sets of literals. Examples and non-examples ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Rocchio Algorithm The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant.Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: ''An Introduction to Information Retrieval'', page 163-167. Cambridge University Press, 2009. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section. Algorithm The formula and variabl ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Linguistic Research Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguistics is concerned with both the cognitive and social aspects of language. It is considered a scientific field as well as an academic discipline; it has been classified as a social science, natural science, cognitive science,Thagard, PaulCognitive Science, The Stanford Encyclopedia of Philosophy (Fall 2008 Edition), Edward N. Zalta (ed.). or part of the humanities. Traditional areas of linguistic analysis correspond to phenomena found in human linguistic systems, such as syntax (rules governing the structure of sentences); semantics (meaning); morphology (structure of words); phonetics (speech sounds and equivalent gestures in sign languages); phonology (the abstract sound system of a particular language); and pragmatics (how social contex ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Information Retrieval Techniques Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, and any observable pattern in any medium can be said to convey some amount of information. Whereas digital signals and other data use discrete signs to convey information, other phenomena and artifacts such as analog signals, poems, pictures, music or other sounds, and currents convey information in a more continuous form. Information is not knowledge itself, but the meaning that may be derived from a representation through interpretation. Information is often processed iteratively: Data available at one step are processed into information to be interpreted and processed at the next step. For example, in written text each symbol or letter conveys information relevant to the word it is part of, each word conveys information relevan ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]