German Reference Corpus
   HOME

TheInfoList



OR:

The German Reference Corpus (original: Deutsches Referenzkorpus; short: DeReKo) is an electronic archive of
text corpora In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language (Leibniz Institute for the German Language, : IDS) in
Mannheim Mannheim (; Palatine German: or ), officially the University City of Mannheim (german: Universitätsstadt Mannheim), is the second-largest city in the German state of Baden-Württemberg after the state capital of Stuttgart, and Germany's 2 ...
,
Germany Germany,, officially the Federal Republic of Germany, is a country in Central Europe. It is the second most populous country in Europe after Russia, and the most populous member state of the European Union. Germany is situated betwe ...
. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens (as of August 2010) and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.


Alternative names

The German Reference Corpus is often referred to by other names, such as ''Mannheim corpora'', ''IDS corpora'', ''COSMAS corpora'' and the corresponding German translations. The name ''Deutsches Referenzkorpus (DeReKo)'' was originally used for a specific portion of the current archive which was collected between 1999 and 2002 by a number of institutions in a joint project under the same name. Since 2004, ''Deutsches Referenzkorpus (DeReKo)'' is the official name of the full corpus archive.


Conception and composition

The German Reference Corpus comprises fictional and academic texts, a large number of newspaper texts and several other text types. The texts cover the time range from around 1950 to the present. In contrast to other well-known corpora and corpus archives (such as the British National Corpus), however, the German Reference Corpus is explicitly not designed as a ''balanced corpus'': The distribution of DeReKo texts across time or text types does not match some predefined percentages. This conception complies with the fact that whether or not a given corpus constitutes a balanced or even representative language
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
may only be assessed with respect to a specific language domain (i.e., the statistical population). Because different linguistic investigations generally aim at different language domains, the declared purpose of the German Reference Corpus is to serve as a versatile superordinate sample, or '' primordial sample'' (German: ''Ur-Stichprobe'') of contemporary written German, from which corpus users may draw a specialised subsample (a so-called '' virtual corpus'') to represent the language domain they wish to investigate.


Access

Due to copyright and licence restrictions, the DeReKo archive may not be copied nor offered for download. It can be queried and analyzed free of charge via the system ''COSMAS II'' - end-users are required to register by name and to agree to use the corpus data exclusively for non-commercial, academic purposes. ''COSMAS II'' enables users to compile from DeReKo a ''virtual corpus'' suitable for their specific research questions.


See also

* Text corpus *
Corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
* American National Corpus (ANC) * Bank of English (BoE) * British National Corpus (BNC) * Corpus of Contemporary American English (COCA) * Oxford English Corpus (OEC)


References

* Kupietz, M. & C. Belica & H. Keibel & A. Witt (2010)
The German Reference Corpus DeReKo: A primordial sample for linguistic research
In: Calzolari, N. et al. (eds.): Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010) (pp. 1848–1854). Valletta, Malta: European Language Resources Association (ELRA). * Kupietz, M. & H. Keibel (2009)
The Mannheim German Reference Corpus (DeReKo) as a basis for empirical linguistic research
In: Working Papers in Corpus-based Linguistics and Language Education, No. 3 (pp. 53–59). Tokyo: Tokyo University of Foreign Studies (TUFS).


External links


DeReKo website (German)COSMAS II
- free DeReKo interface (German website) {{Corpus linguistics German language Germanic philology Corpora Applied linguistics Linguistic research