Sketch Engine

	Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or language learners) to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages. History of development Sketch Engine is a product of Lexical Computing Limited, a company founded in 2003 by the lexicographer and research scientist Adam Kilgarriff. He started a collaboration with Pavel Rychlý, a computer scientist working at the Natural Language Processing Centre, Masaryk University, and the developer of Manatee and Bonito (two major parts of the software suite), and introduce ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Adam Kilgarriff Adam Kilgarriff (12 February 1960 – 16 May 2015) was a corpus linguist, lexicographer, and co-author of Sketch Engine. Life His parents were booksellers. He spent one year as a volunteer in Kenya 1978–1979 then began studying at Cambridge University, graduating with a first class BA degree in philosophy and engineering in 1982. His first job was as a Housing Officer for the London and Quadrant Housing Trust. At the same time he studied at the South West London College. In 1987, he left his job and started an MSc in intelligent knowledge-based systems at the University of Sussex, from where he graduated the following year, continuing a DPhil in computational linguistics with thesis Polysemy (1992). In 2008 he made a return trip to Kenya with his old friend Raphael. He was also a participant in the Hastings Half Marathon for many years. In November 2014, he was diagnosed with stage 4 bowel cancer which he succumbed to in May 2015. After the diagnosis he started his own blo ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Text Analysis Content analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the key advantages of using content analysis to analyse social phenomena is its non-invasive nature, in contrast to simulating social experiences or collecting survey answers. Practices and philosophies of content analysis vary between academic disciplines. They all involve systematic reading or observation of texts or artifacts which are assigned labels (sometimes called codes) to indicate the presence of interesting, meaningful pieces of content. By systematically labeling the content of a set of texts, researchers can analyse patterns of content quantitatively using statistical methods, or use qualitative methods to analyse meanings of content within texts. Computers are increasingly used in content analysis to autom ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Part-of-speech Tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms. Principle Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Trend Analysis Trend analysis is the widespread practice of collecting information and attempting to spot a pattern. In some fields of study, the term has more formally defined meanings. Although trend analysis is often used to predict future events, it could be used to estimate uncertain events in the past, such as how many ancient kings probably ruled between two dates, based on data such as the average years which other known kings reigned. Project management In project management, trend analysis is a mathematical technique that uses historical results to predict future outcome. This is achieved by tracking variances in cost and schedule performance. In this context, it is a project management quality control tool. Statistics In statistics, trend analysis often refers to techniques for extracting an underlying pattern of behavior in a time series which would otherwise be partly or nearly completely hidden by noise. If the trend can be assumed to be linear, trend analysis can be undertaken w ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Keyword (linguistics) In corpus linguistics a key word is a word which occurs in a text more often than we would expect to occur by chance alone.Scott, M. & Tribble, C., 2006, ''Textual Patterns: keyword and corpus analysis in language education'', Amsterdam: Benjamins, 55. Key words are calculated by carrying out a statistical test (e.g., loglinear or chi-squared) which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus, which acts as a reference for general language use. Keyness is then the quality a word or phrase has of being "key" in its context. Compare this with collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words ..., the quality linking two words or phrases usually assumed to be within a given span of each other. Keyness is a ''textu ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Terminology Extraction Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus. In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the internet. Modeling these communities and their information needs is important for several web applications, like topic-driven web crawlers, web services, recommender systems, etc. The development of terminology extraction is also essential to the language industry. One of the first steps to model a knowledge domain is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain concepts. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature. Typically, approaches to ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	N-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The ''n''-grams typically are collected from a text or speech corpus. When the items are words, -grams may also be called ''shingles''. Using Latin numerical prefixes, an ''n''-gram of size 1 is referred to as a "unigram"; size 2 is a " bigram" (or, less commonly, a "digram"); size 3 is a " trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on. In computational biology, a polymer or oligomer of a known size is called a ''k''-mer instead of an ''n''-gram, with specific names using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. App ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated. An example of a phraseological collocation is the expression ''strong tea''. While the same meaning could be conveyed by the roughly equivalent ''powerful tea'', this adjective does not modify ''tea'' frequently enough for English speakers to become accustomed to its co-occurrence and regard it as idiomatic or unmarked. (By way of counterexample, ''powerful'' is idiomatically preferred to ''strong'' when modifying a ''computer'' or a ''car''.) There are about six main types of collocations: adjective + noun, noun + noun (such as collective nouns), verb + noun, ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Concordance (publishing) A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre- computer era. A concordance is more than an index, with additional material such as commentary, definitions and topical cross-indexing which makes producing one a labor-intensive process even when assisted by computers. In the precomputing era, search technology was unavailable, and a concordance offered readers of long works such as the Bible something comparable to search results for every word that they would have been likely to search for. Today, the ability to combine the result of queries concerning multiple terms (such as searching for words ne ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Thesaurus A thesaurus (plural ''thesauri'' or ''thesauruses'') or synonym dictionary is a reference work for finding synonyms and sometimes antonyms of words. They are often used by writers to help find the best word to express an idea: Synonym dictionaries have a long history. The word 'thesaurus' was used in 1852 by Peter Mark Roget for his '' Roget's Thesaurus''. While some thesauri, such as ''Roget's Thesaurus'', group words in a hierarchical hypernymic taxonomy of concepts, others are organized alphabetically or in some other way. Most thesauri do not include definitions, but many dictionaries include listings of synonyms. Some thesauri and dictionary synonym notes characterize the distinctions between similar words, with notes on their "connotations and varying shades of meaning".''American Heritage Dictionary of the English Language'', 5th edition, Houghton Mifflin Harcourt 2011, , p. xxvii Some synonym dictionaries are primarily concerned with differentiating synonyms by m ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Word Sketch A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam KilgarriffKilgarriff, Adam; Rychlý, Pavel; Smrž, Pavel; Tugwell, David (2004) The Sketch Engine. Information Technology, 2004 and exploited within the Sketch Engine corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relations (e.g. subject, object, modifier etc.). The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score. Since the introduction, word sketches have been used by lexicographers to develop modern corpus-based dictionaries by major publishing houses including Oxford English Dictionary, Macmillan English Dictionary and comprising dozens of languages including En ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Masaryk University Masaryk University (MU) ( cs, Masarykova univerzita; la, Universitas Masarykiana Brunensis) is the second largest university in the Czech Republic, a member of the Compostela Group and the Utrecht Network. Founded in 1919 in Brno as the second Czech university (after Charles University established in 1348 and Palacký University existent in 1573–1860), it now consists of ten faculties and 35,115 students. It is named after Tomáš Garrigue Masaryk, the first president of an independent Czechoslovakia as well as the leader of the movement for a second Czech university. In 1960 the university was renamed ''Jan Evangelista Purkyně University'' after Jan Evangelista Purkyně, a Czech biologist. In 1990, following the Velvet Revolution it regained its original name. Since 1922, over 171,000 students have graduated from the university. History Masaryk University was founded on 28 January 1919 with four faculties: Law, Medicine, Science, and Arts. Tomáš Garrigue Masaryk, ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]