HOME

TheInfoList



OR:

Sketch Engine is a
corpus manager A corpus manager (corpus browser or corpus query system) is a tool for multilingual corpus analysis, which allows effective searching in corpora. A corpus manager usually represents a complex tool that allows one to perform searches for language ...
and
text analysis Content analysis is the study of documents and communication artifacts, known as texts e.g. photos, speeches or essays. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the ...
software Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications. The history of software is closely tied to the development of digital comput ...
developed by Lexical Computing since 2003. Its purpose is to enable people studying
language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...
behaviour (
lexicographers This list contains people who contributed to the field of lexicography, the theory and practice of compiling dictionaries. __NOTOC__ A * Maulvi Abdul Haq (India/Pakistan, 1872–1961) Baba-e-Urdu, English-Urdu dictionary *Ivar Aasen (Norway, 181 ...
, researchers in
corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
,
translators Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transl ...
or language learners) to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.


History of development

Sketch Engine is a product of Lexical Computing, a company founded in 2003 by the lexicographer and research scientist
Adam Kilgarriff Adam Kilgarriff (12 February 1960 – 16 May 2015) was a corpus linguist, lexicographer, and co-author of Sketch Engine. Life His parents were booksellers. He spent one year as a volunteer in Kenya 1978–1979 then began studying at Cambridg ...
. He started a collaboration with Pavel Rychlý, a computer scientist working at the Natural Language Processing Centre,
Masaryk University Masaryk University (MU) (; ) is the second largest university in the Czech Republic, a member of the Compostela Group and the Utrecht Network. Founded in 1919 in Brno, it now consists of ten faculties and 35,115 students. It is named after To ...
, and the developer of Manatee and Bonito (two major parts of the software suite). Kilgarriff also introduced the concept of word sketches. Since then, Sketch Engine has been commercial software, however, all the core features of Manatee and Bonito that were developed by 2003 (and extended since then) are freely available under the GPL license within the NoSketch Engine suite.


Features

A list of tools available in Sketch Engine: * Word sketches – a one-page automatic derived summary of a word's grammatical and collocational behaviour * Word sketch difference – compares and contrasts two words by analysing their collocations * Distributional
thesaurus A thesaurus (: thesauri or thesauruses), sometimes called a synonym dictionary or dictionary of synonyms, is a reference work which arranges words by their meanings (or in simpler terms, a book where one can find different words with similar me ...
– automated thesaurus for finding words with similar meaning or appearing in the same/similar context * Concordance search – finds occurrences of a
word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
form, lemma, phrase, tag or complex structure *
Collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words t ...
search – word co-occurrence analysis displaying the most frequent words (for a search word) which can be regarded as collocation candidates * Word lists – generates frequency lists which can be filtered with complex criteria *
n-gram An ''n''-gram is a sequence of ''n'' adjacent symbols in particular order. The symbols may be ''n'' adjacent letter (alphabet), letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or ...
s – generates frequency lists of multi-word expressions *
Terminology Terminology is a group of specialized words and respective meanings in a particular field, and also the study of such terms and their use; the latter meaning is also known as terminology science. A ''term'' is a word, Compound (linguistics), com ...
/ Keyword extraction (both monolingual and bilingual) – automatic extraction of key words and multi-word terms from texts (based on frequency count and linguistic criteria) * Diachronic analysis (
Trends A fad, trend, or craze is any form of collective behavior that develops within a culture, a generation, or social group in which a group of people enthusiastically follow an impulse for a short time period. Fads are objects or behaviors that ...
) – detecting words which undergo changes in the frequency of use in time (show trending words) * Corpus building and management – create corpora from
the Web The World Wide Web (WWW or simply the Web) is an information system that enables content sharing over the Internet through user-friendly ways meant to appeal to users beyond IT specialists and hobbyists. It allows documents and other web ...
or uploaded texts including
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
and lemmatization which can be used as
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
software * Parallel corpus (bilingual) facilities – looking up translation examples (
EUR-Lex EUR-Lex is the official online database of European Union law and other public documents of the European Union (EU), published in 24 official Languages of the European Union, languages of the EU. The Official Journal of the European Union, Offici ...
corpus, Europarl corpus, OPUS corpus, etc.) or building a parallel corpus from own aligned texts * Text type analysis – statistics of metadata in the corpus


Keywords and terminology extraction

Sketch Engine can perform automatic term extraction by identifying words typical of a particular corpus, document, or text. Single words and multi-word units can be extracted from monolingual or bilingual texts. The terminology extraction feature provides a list of relevant terms based on comparison with a large corpus of general language. This functionality is also available as a separate service calle
OneClick Terms
with a dedicated interface.


SKELL

A free web service based on Sketch Engine and aimed at language learners and
teachers A teacher, also called a schoolteacher or formally an educator, is a person who helps students to acquire knowledge, competence, or virtue, via the practice of teaching. ''Informally'' the role of teacher may be taken on by anyone (e.g. w ...
is SKELL (formerly ''SkELL''). It exploits Sketch Engine's proprietary GDEX (Good Dictionary Examples) scoring function to provide authentic example sentences for specific target words. Results are drawn from a special corpus of high-quality texts covering everyday, standard, formal, and professional language and displayed as a concordance. SKELL also includes simplified versions of Sketch Engine's word sketch and
thesaurus A thesaurus (: thesauri or thesauruses), sometimes called a synonym dictionary or dictionary of synonyms, is a reference work which arranges words by their meanings (or in simpler terms, a book where one can find different words with similar me ...
functions. It has been suggested that SKELL can be used, for instance, to help students understand the meaning and/or usage of a word or phrase; to help teachers wanting to use example sentences in a class; to discover and explore collocates; to create gap-fill exercises; to teach various kinds of
homonyms In linguistics, homonyms are words which are either; ''homographs''—words that mean different things, but have the same spelling (regardless of pronunciation), or ''homophones''—words that mean different things, but have the same pronunciatio ...
and polysemous words. SKELL was first presented in 2014, when only English was supported. Later, support was added for
Russian Russian(s) may refer to: *Russians (), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries *A citizen of Russia *Russian language, the most widely spoken of the Slavic languages *''The Russians'', a b ...
,
Czech Czech may refer to: * Anything from or related to the Czech Republic, a country in Europe ** Czech language ** Czechs, the people of the area ** Czech culture ** Czech cuisine * One of three mythical brothers, Lech, Czech, and Rus *Czech (surnam ...
,
German German(s) may refer to: * Germany, the country of the Germans and German things **Germania (Roman era) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizenship in Germany, see also Ge ...
,
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance languag ...
and
Estonian Estonian may refer to: * Something of, from, or related to Estonia, a country in the Baltic region in northern Europe * Estonians, people from Estonia, or of Estonian descent * Estonian language * Estonian cuisine * Estonian culture See also

...
.


List of text corpora

Sketch Engine provides access to more than 700 text corpora. There are monolingual as well as multilingual corpora of different sizes (from one thousand words up to 60 billion words) and various sources (e.g. web, books, subtitles, legal documents). The list of corpora includes
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
,
Brown Corpus The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured Text_corpus, corpus of varied genres. This ...
, Cambridge Academic English Corpus and Cambridge Learner Corpus,
CHILDES The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository for data of first language acquisition. Its earliest transcripts date from the 1960s, and as ...
corpora of child language, OpenSubtitles (a set of 60 parallel corpora), 24 multilingual corpora of
EUR-Lex EUR-Lex is the official online database of European Union law and other public documents of the European Union (EU), published in 24 official Languages of the European Union, languages of the EU. The Official Journal of the European Union, Offici ...
documents, the TenTen Corpus Family (multi-billion web corpora), and Trends corpora (monitor corpora with daily updates).


Architecture

Sketch Engine consists of three main components: an underlying
database management system In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and an ...
called Manatee, a web interface search front-end called Bonito, and a web interface for corpus building and management called Corpus Architect.


Manatee

Manatee is a
database management system In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and an ...
specifically devised for effective indexing of large text corpora. It is based on the idea of
inverted index In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of d ...
ing (keeping an index of all positions of a given word in the text). It has been used to index text corpora comprising tens of billions of words. Searching corpora indexed by Manatee is performed by formulating queries in the Corpus Query Language (CQL). Manatee is written in C++ and offers an
API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
for a number of other programming languages including Python,
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
and
Ruby Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...
. Recently, it was rewritten into Go for faster processing of corpus queries.


Bonito

Bonito is a web interface for Manatee providing access to corpus search. In the
client–server model The client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients. Often clients and servers communicate ov ...
, Manatee is the server and Bonito plays the client part. It is written in Python.


Corpus Architect

Corpus Architect is a web interface providing corpus building and management features. It is also written in Python.


Applications

Sketch Engine has been used by major
British British may refer to: Peoples, culture, and language * British people, nationals or natives of the United Kingdom, British Overseas Territories and Crown Dependencies. * British national identity, the characteristics of British people and culture ...
and other publishing houses for producing dictionaries such as Macmillan English Dictionary, Dictionnaires Le Robert,
Oxford University Press Oxford University Press (OUP) is the publishing house of the University of Oxford. It is the largest university press in the world. Its first book was printed in Oxford in 1478, with the Press officially granted the legal right to print books ...
or
Shogakukan is a Japanese publisher of comics, magazines, light novels, dictionaries, literature, non-fiction, home media, and other media in Japan. Shogakukan founded Shueisha, which also founded Hakusensha. These are three separate companies, but ...
. Four of United Kingdom's five biggest dictionary publishers use Sketch Engine.


References


Further reading

*


External links


Sketch Engine website

List of corpora available in Sketch Engine

OneClick terms – online term extractor with term extraction technology from Sketch Engine

SKELL – Sketch Engine for language learning
{{Corpus linguistics Applied linguistics Computational linguistics Corpus linguistics Database management systems Data mining and machine learning software Lexicography Linguistic research Natural language processing Text analysis Text mining