WIKTIONARY is a multilingual , web -based project to create a free
content dictionary of all words in all languages. It is
collaboratively edited via a wiki , and its name is a portmanteau of
the words wiki and dictionary . It is available in 171 languages and
in Simple English . Like its sister project ,
run by the
Wikimedia Foundation , and is written collaboratively by
volunteers , dubbed "Wiktionarians". Its wiki software , Media
allows almost anyone with access to the website to create and edit
Wiktionary is not limited by print space considerations, most
of Wiktionary's language editions provide definitions and translations
of words from many languages, and some editions offer additional
information typically found in thesauri and lexicons . The English
Wiktionary includes a WIKISAURUS (thesaurus) of synonyms of various
Wiktionary data are frequently used in various natural language
processing tasks .
* 1 History and development
* 1.1 Logos
* 2 Accuracy
* 3 Critical reception
Wiktionary data in natural language processing
* 5 See also
* 6 Notes
* 7 References
* 8 External links
HISTORY AND DEVELOPMENT
Wiktionary was brought online on December 12, 2002, following a
proposal by Daniel Alston and an idea by
Larry Sanger , co-founder of
Wikipedia. On March 28, 2004, the first non-English Wiktionaries were
initiated in French and Polish . Wiktionaries in numerous other
languages have since been started.
Wiktionary was hosted on a
temporary domain name (wiktionary.wikipedia.org) until May 1, 2004,
when it switched to the current domain name. As of November 2016,
Wiktionary features over 25.9 million entries across its editions.
The largest of the language editions is the English Wiktionary, with
over 5 million entries, followed by the Malagasy
Wiktionary with over
3.9 million bot -generated entries and the French
Wiktionary with over
3 million. Forty-one
Wiktionary language editions now contain over
100,000 entries each. The use of bots to generate large numbers
of articles is visible as "growth spurts" in this graph of article
counts at the largest eight
Wiktionary editions. (Data as of December
Most of the entries and many of the definitions at the project's
largest language editions were created by bots that found creative
ways to generate entries or (rarely) automatically imported thousands
of entries from previously published dictionaries. Seven of the 18
bots registered at the English
Wiktionary created 163,000 of the
Another of these bots, "ThirdPersBot," was responsible for the
addition of a number of third-person conjugations that would not have
received their own entries in standard dictionaries; for instance, it
defined "smoulders" as the "third-person singular simple present form
of smoulder." Of the 648,970 definitions the English Wiktionary
provides for 501,171 English words, 217,850 are "form of" definitions
of this kind. This means its coverage of English is slightly smaller
than that of major monolingual print dictionaries. The Oxford English
Dictionary , for instance, has 615,000 headwords, while
Merriam-Webster\'s Third New International
Dictionary of the English
Language, Unabridged has 475,000 entries (with many additional
embedded headwords). Detailed statistics exist to show how many
entries of various kinds exist.
Wiktionary does not rely on bots to the extent that some
other editions do. The French and Vietnamese Wiktionaries, for
example, imported large sections of the Free Vietnamese Dictionary
Project (FVDP), which provides free content bilingual dictionaries to
and from Vietnamese. These imported entries make up virtually all of
the Vietnamese edition's contents. Almost all non-Malagasy-language
entries of the Malagasy
Wiktionary were copied by bot from other
Wiktionaries. Like the English edition, the French
imported the approximately 20,000 entries from the Unihan database of
Chinese, Japanese, and Korean characters . The French
rapidly in 2006 thanks in large part to bots copying many entries from
old, freely licensed dictionaries, such as the eighth edition of the
Dictionnaire de l\'Académie française (1935, around 35,000 words),
and using bots to add words from other
Wiktionary editions with French
translations. The Russian edition grew by nearly 80,000 entries as
"LXbot" added boilerplate entries (with headings, but without
definitions) for words in English and German .
In 2017 English part of en.wikitionary had over 500,000 gloss
definitions and over 900,000 definitions (including different forms).
Wiktionary has historically lacked a uniform logo across its numerous
language editions. Some editions use logos that depict a dictionary
entry about the term "Wiktionary", based on the previous English
Wiktionary logo, which was designed by Brion Vibber, a MediaWiki
developer. Because a purely textual logo must vary considerably from
language to language, a four-phase contest to adopt a uniform logo was
held at the Wikimedia Meta-
Wiki from September to October 2006. Some
communities adopted the winning entry by "Smurrayinchester", a 3×3
grid of wooden tiles, each bearing a character from a different
writing system. However, the poll did not see as much participation
Wiktionary community as some community members had hoped, and
a number of the larger wikis ultimately kept their textual logos.
In April 2009, the issue was resurrected with a new contest. This
time, a depiction by "AAEngelman" of an open hardbound dictionary won
a head-to-head vote against the 2006 logo, but the process to refine
and adopt the new logo then stalled. In the following years, some
wikis replaced their textual logos with one of the two newer logos. In
2012, 55 wikis that had been using the English
received localized versions of the 2006 design by "Smurrayinchester".
In July 2016, the English
Wiktionary adopted a variant of this logo.
As of 4 July 2016 , 135 wikis, representing 61% of Wiktionary's
entries, use a logo based on the 2006 design by "Smurrayinchester", 33
wikis (36%) use a textual logo, and three wikis (3%) use the 2009
design by "AAEngelman".
To ensure accuracy, the English
Wiktionary has a policy requiring
that terms be attested. Terms in major languages such as English and
Chinese must be verified by:
* clearly widespread use, or
* use in permanently recorded media, conveying meaning, in at least
three independent instances spanning at least a year.
For smaller languages such as Creek and extinct languages such as
Latin , one use in a permanently recorded medium or one mention in a
reference work is sufficient verification.
This section's FACTUAL ACCURACY MAY BE COMPROMISED DUE TO
OUT-OF-DATE INFORMATION. Please update this article to reflect recent
events or newly available information. (May 2013)
Critical reception of
Wiktionary has been mixed. In 2006 Jill Lepore
wrote in the article "Noah's Ark" for The New Yorker,
There's no show of hands at Wiktionary. There's not even an editorial
staff. "Be your own lexicographer!", might be Wiktionary's motto. Who
needs experts? Why pay good money for a dictionary written by
lexicographers when we could cobble one together ourselves?
Wiktionary isn't so much republican or democratic as Maoist. And it's
only as good as the copyright-expired books from which it pilfers.
Keir Graff 's review for Booklist was less critical:
Is there a place for Wiktionary? Undoubtedly. The industry and
enthusiasm of its many creators are proof that there's a market. And
it's wonderful to have another strong source to use when searching the
odd terms that pop up in today's fast-changing world and the online
environment. But as with so many Web sources (including this column),
it's best used by sophisticated users in conjunction with more
References in other publications are fleeting and part of larger
discussions of, not progressing beyond a definition,
although David Brooks in The Nashua Telegraph described it as "wild
and woolly". One of the impediments to independent coverage of
Wiktionary is the continuing confusion that it is merely an extension
of. In 2005,
PC Magazine rated
Wiktionary as one of the
Internet's "Top 101 Web Sites", although little information was given
about the site.
The measure of correctness of the inflections for a subset of the
Polish words in the English
Wiktionary showed that this grammatical
data is very stable. Only 131 out of 4748 Polish words have had their
inflection data corrected.
WIKTIONARY DATA IN NATURAL LANGUAGE PROCESSING
Wiktionary has semi-structured data .
Wiktionary lexicographic data
can be converted to machine-readable format in order to be used in
natural language processing tasks.
Wiktionary data mining is a complex task. There are the following
difficulties: (1) the constant and frequent changes to data and
schemata, (2) the heterogeneity in
Wiktionary language edition
schemata and (3) the human-centric nature of a wiki .
There are several parsers for different
Wiktionary language editions:
DBpedia Wiktionary: a subproject of
DBpedia , the data are
extracted from English, French, German and Russian wiktionaries; the
data includes language, part of speech, definitions, semantic
relations and translations. The declarative description of the page
schema, regular expressions and finite state transducer are used
in order to extract information.
* JWKTL (Java
Wiktionary Library): provides access to English
Wiktionary and German
Wiktionary dumps via a Java
Wiktionary API .
The data includes language, part of speech, definitions, quotations,
semantic relations, etymologies and translations. JWKTL is available
for non-commercial use.
* wikokit: the parser of English
Wiktionary and Russian Wiktionary.
The parsed data includes language, part of speech, definitions,
quotations, semantic relations and translations. This is a
multi-licensed open-source software.
* Etymological entries have been parsed in the Etymological WordNet
The various natural language processing tasks were solved with the
Rule-based machine translation between
Dutch language and
Afrikaans ; data of English Wiktionary, Dutch
were used with the
Apertium machine translation platform.
* Construction of machine-readable dictionary by the parser NULEX,
which integrates open linguistic resources: English Wiktionary,
WordNet , and
VerbNet . The parser NULEX scrapes English Wiktionary
for tense information (verbs), plural form and part of speech (nouns).
Speech recognition and synthesis , where
Wiktionary was used to
automatically create pronunciation dictionaries. Word-pronunciation
pairs were retrieved from 6
Wiktionary language editions (Czech,
English, French, Spanish, Polish, and German). Pronunciations are in
terms of the
International Phonetic Alphabet
International Phonetic Alphabet . The ASR system based
Wiktionary has the highest word error rate, where each
third phoneme has to be changed.
Ontology engineering and semantic network constructing.
* Ontology matching .
Text simplification . Medero & Ostendorf assessed vocabulary
difficulty (reading level detection) with the help of
Properties of words extracted from
Wiktionary entries (definition
length and POS , sense, and translation counts) were investigated.
Medero -webkit-column-width: 35em; column-width: 35em;
* ^ mailing list archive discussion announcing the
opening of the
Wiktionary project – Retrieved May 3, 2011
* ^ mailing list archive discussion from Larry Sanger
giving the idea on
Wiktionary – Retrieved May 3, 2011
* ^ Wiktionary's current URL is www.wiktionary.org.
Wiktionary total article counts are here. Detailed statistics
by word type are available here .
* ^ The user list at the English
Wiktionary identifies accounts
that have been given "bot status".
* ^ Hồ Ngọc Đức, Free Vietnamese
Dictionary Project. Details
at the Vietnamese Wiktionary.
* ^ "
Wiktionary Logo", English Wiktionary,
* ^ A B "Wiktionary/logo", Meta-Wiki,
Wikimedia Foundation .
* ^ "Wiktionary/logo/refresh/voting", Meta-Wiki, Wikimedia
* ^ 56 Wiktionaries got a localised logo
* ^ m:Wiktionary/logo#Logo use statistics.
* ^ The full article is not available on-line.
* ^ David Brooks, "Online, interactive encyclopedia not just for
geeks anymore, because everyone seems to need it now, more than ever!"
The Nashua Telegraph (August 4, 2004)
* ^ In this citation, the author refers to
Wiktionary as part of
the site: Adapted from an article by Naomi DeTullio (2006).
"Wikis for Librarians" (PDF). NETLS News #142. Northeast Texas Library
System. p. 15. Archived from the original (PDF newsletter) on
2007-06-05. Retrieved April 21, 2007.
* ^ E.g. compare the entry structure and formatting rules in
Wiktionary and Russian Wiktionary.
* ^ Quotations are extracted only from Russian Wiktionary.
* ^ If there are several IPA notations on a
Wiktionary page –
either for different languages or for pronunciation variants, then the
first pronunciation was extracted.
* ^ http://conceptnet5.media.mit.edu
* ^ The source code and the results of POS-tagging are available at
* ^ "Wiktionary.org Site Info".
Alexa Internet . Retrieved
* ^ https://www.wiktionary.org/
* ^ TheDaveBot, TheCheatBot, Websterbot, PastBot, NanshuBot
* ^ Detailed statistics as of 1 July 2013
* ^ LXbot Archived May 24, 2008, at the
Wayback Machine .
* ^ Wikitionary statistics
* ^ phab:T139255
* ^ "Wiktionary:Criteria for inclusion". Wiktionary. Retrieved 13
* ^ Lepore 2006 .
* ^ PC Mag 2005 .
* ^ Kurmas 2010 .
* ^ Meyer & Gurevych 2012 , p. 140.
* ^ Zesch, Müller & Gurevych 2008 , p. 4, Figure 1.
* ^ Meyer & Gurevych 2010 , p. 40.
* ^ Krizhanovsky, Transformation 2010 , p. 1.
* ^ Hellmann & Auer 2013 , p. 302, p. 16 in PDF.
* ^ Hellmann, Brekle & Auer 2012 , p. 3, Table 1.
* ^ Hellmann, Brekle & Auer 2012 , pp. 8–9.
* ^ Hellmann, Brekle & Auer 2012 , p. 10.
* ^ Hellmann, Brekle & Auer 2012 , p. 11.
* ^ JWKTL
* ^ Zesch, Müller & Gurevych 2008 .
* ^ wikokit
* ^ Krizhanovsky, Transformation 2010 .
* ^ A B Smirnov 2012 .
* ^ Krizhanovsky, Comparison 2010 .
* ^ Etymological WordNet
* ^ Krizhanovsky 2012 , p. 14.
* ^ Otte & Tyers 2011 .
* ^ McFate & Forbus 2011 .
* ^ Schlippe, Ochs & Schultz 2012 .
* ^ Schlippe, Ochs & Schultz 2012 , p. 4802.
* ^ Schlippe, Ochs & Schultz 2012 , p. 4804.
* ^ Meyer & Gurevych 2012 .
* ^ Lin & Krizhanovsky 2011 .
* ^ Medero & Ostendorf 2009 .
* ^ Li, Graça Vincent, Bruce; Xu, Li; Srihari, Rohini K. (2006).
"Using verbs and adjectives to automatically classify blog sentiment"
(PDF). Training. 580: 233–235. Retrieved May 9, 2013.
* Hellmann, Sebastian; Brekle, Jonas; Auer, Sören (2012).
"Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a
Linguistic Data Cloud" (PDF). Proc. Joint Int. Semantic Technology
Conference (JIST). Nara, Japan.
* Hellmann, S.; Auer, S. (2013). "Towards Web-Scale Collaborative
Knowledge Extraction" (PDF). In Gurevych, Iryna; Kim, Jungi. The
People's Web Meets NLP. Theory and Applications of Natural Language
Springer-Verlag . pp. 287–313. ISBN 978-3-642-35084-9 .
* Krizhanovsky, Andrew (2010). "Transformation of
structure into tables and relations in a relational database schema".
arXiv :1011.1368 .
* Krizhanovsky, Andrew (2010). "The comparison of Wiktionary
thesauri transformed into the machine-readable format". arXiv
* Krizhanovsky, Andrew (2012). "A quantitative analysis of the
English lexicon in Wiktionaries and WordNet" (PDF). International
Journal of Intelligent Information Technologies (IJIIT). 8 (4):
13–22. doi :10.4018/jiit.2012100102 . Retrieved May 9, 2013.
* Kurmas, Zachary (July 2010). Zawilinski: a library for studying
grammar in Wiktionary. Proceedings of the 6th International Symposium
on Wikis and Open Collaboration. Gdansk, Poland. Retrieved 2011-07-29.
* Li, Shen; Graça, Joao V.; Taskar, Ben (2012). "Wiki-ly supervised
part-of-speech tagging" (PDF). Proceedings of the 2012 Joint
Conference on Empirical Methods in Natural
Language Processing and
Language Learning. Jeju Island, Korea:
Association for Computational Linguistics. pp. 1389–1398.
* Lepore, Jill (November 6, 2006). "Noah\'s Ark" (Abstract). The New
Yorker. Retrieved April 21, 2007.
* Lin, Feiyu; Krizhanovsky, Andrew (2011). "Multilingual ontology
matching based on
Wiktionary data accessible via SPARQL endpoint".
Proc. of the 13th Russian Conference on Digital Libraries RCDL'2011.
Voronezh, Russia. pp. 19–26. arXiv :1109.0732 .
* McFate, Clifton J.; Forbus, Kenneth D. (2011). "NULEX: an
open-license broad coverage lexicon" (PDF). The 49th Annual Meeting of
the Association for Computational Linguistics: Human Language
Technologies, Proceedings of the Conference. Portland, Oregon, USA:
The Association for Computer Linguistics. pp. 363–367. ISBN
* Medero, Julie; Ostendorf, Mari (2009). "Analysis of vocabulary
difficulty using wiktionary" (PDF). Proc. SLaTE Workshop.
* Meyer, C. M.; Gurevych, I. (2010). "Worth its Weight in Gold or
Yet Another Resource - A Comparative Study of Wiktionary,
Thesaurus and GermaNet" (PDF). Proc. 11th International Conference
on Intelligent Text Processing and Computational Linguistics, Iasi,
Romania. pp. 38–49.
* Meyer, C. M.; Gurevych, I. (2012). "Onto
Constructing an Ontology from the Collaborative Online Dictionary
Wiktionary" (PDF). In Pazienza, M. T.; Stellato, A. Semi-Automatic
Ontology Development: Processes and Resources. IGI Global. pp.
131–161. ISBN 978-1-4666-0188-8 .
* Otte, Pim; Tyers, F. M. (2011). "Rapid rule-based machine
translation between Dutch and Afrikaans" (PDF). In Forcada, Mikel L.;
Depraetere, Heidi; Vandeghinste, Vincent. 16th Annual Conference of
the European Association of Machine Translation, EAMT11. Leuven,
Belgium. pp. 153–160.
* Schlippe, Tim; Ochs, Sebastian; Schultz, Tanja (2012).
"Grapheme-to-phoneme model generation for Indo-European languages"
(PDF). Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan.
* Smirnov A., Levashova T., Karpov A., Kipyatkova I., Ronzhin A.,
Krizhanovsky A., Krizhanovsky N.. Analysis of the quotation corpus of
the Russian Wiktionary. Research in Computing Science. 2012
* Zesch, Torsten; Müller, Christof; Gurevych, Iryna (2008).
"Extracting Lexical Semantic Knowledge from and Wiktionary"
(PDF). Proceedings of the Conference on
Language Resources and
Evaluation (LREC). Marrakech, Morocco.
* "Wiktionary". Top 101 Web Sites. PC Magazine. April 6, 2005.
Retrieved December 16, 2005.
* List of all Wiktionary