HOME

TheInfoList



OR:

The Moby Project is a collection of public-domain lexical resources created by
Grady Ward William Grady Ward (born April 4, 1951) is an American software engineer, lexicographer, and Internet activist who has been prominent in the Scientology versus the Internet controversy. Biography Grady Ward created the Moby Project, an extensive ...
. The resources were dedicated to the public domain, and are now mirrored at
Project Gutenberg Project Gutenberg (PG) is a Virtual volunteering, volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks." It was founded in 1971 by American writer Michael S. Hart and is the ...
. , it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations.


Hyphenator

The Moby Hyphenator II contains hyphenations of 187,175 words and phrases (including 9,752 entries where no hyphenations are given, such as ''through'' and ''avoir''). The character encoding appears to be
MacRoman Mac OS Roman is a character encoding created by Apple Computer, Inc. for use by Macintosh computers. It is suitable for representing text in English and several other Western languages. Mac OS Roman encodes 256 characters, the first 128 of which ...
, and hyphenation is indicated by a bullet (character value 165 decimal, or A5 hexadecimal). Some entries, however, have a combination of actual hyphens and character 165, such as "". There is little to no documentation of the hyphenation choices made; the following examples might give some flavour of the style of hyphenation used: .


Languages

Moby Language II contains wordlists of five languages: French,
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ger ...
,
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, an ethnic group or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance language *** Regional Ita ...
,
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
, and
Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Cana ...
. Their statistics are: However, some of the lists are contaminated, for example the Japanese list contains English words such as ''abnormal'' and non-words such as ' and '. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lower-cased words. The list of Italian words, however, contains no capitalized words whatsoever. The lists do not use accented characters, so "" is how a user would look up the French word '' ("to be").


Part-of-Speech

Moby Part-of-Speech contains 233,356 words fully described by part(s) of speech, listed in priority order. The format of the file is ''word\parts-of-speech'', with the following parts of speech being identified:


Pronunciator

The Moby Pronunciator II contains 177,267 entries with corresponding pronunciations. Most of the entries describe a single word, but approximately 79,000 contain hyphenated or multiple word phrases, names, or
lexemes A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken ...
. The Project Gutenberg distribution also contains a copy of the
cmudict The CMU Pronouncing Dictionary (also known as CMUdict) is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research. CMUdict provides a mapping orthograp ...
v0.3. The file contains lines of the format ''word
part-of-speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ass ...
pronunciation''. Each line is ended with the ASCII
carriage return A carriage return, sometimes known as a cartridge return and often shortened to CR, or return, is a control character or mechanism used to reset a device's position to the beginning of a line of text. It is closely associated with the line feed a ...
character (CR, '\r', 0x0D, 13 in decimal). The ''word'' field can include apostrophes (e.g. ''isn't''), hyphens (e.g. ''able-bodied''), and multiple words separated by underscores (e.g. '). Non-English words are generally rendered, as stated in the documentation, without accents or other diacritical marks. However, in 36 entries (e.g. '), some non-ASCII accented characters remain, represented using
Mac OS Roman Mac OS Roman is a character encoding created by Apple Computer, Inc. for use by Macintosh computers. It is suitable for representing text in English and several other Western languages. Mac OS Roman encodes 256 characters, the first 128 of which ...
encoding. The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example, for the words spelled ''close,'' the verb has the pronunciation , whereas the adjective is . The parts-of-speech have been assigned the following codes: Following this is the pronunciation. Several special symbols are present: The rest of the symbols are used to represent
IPA IPA commonly refers to: * India pale ale, a style of beer * International Phonetic Alphabet, a system of phonetic notation * Isopropyl alcohol, a chemical compound IPA may also refer to: Organizations International * Insolvency Practitioners ...
characters. The pronunciations are generally consistent with a
General American General American English or General American (abbreviated GA or GenAm) is the umbrella accent of American English spoken by a majority of Americans. In the United States it is often perceived as lacking any distinctly regional, ethnic, or so ...
dialect of English, that exhibits
father-bother merger The phonology of the open back vowels of the English language has undergone changes both overall and with regional variations, through Old and Middle English to the present. The sounds heard in modern English were significantly influenced by the ...
,
hurry-furry merger In English, many vowel shifts affect only vowels followed by in rhotic dialects, or vowels that were historically followed by that has been elided in non-rhotic dialects. Most of them involve the merging of vowel distinctions and so fewer vowe ...
and
lot-cloth split The phonology of the open back vowels of the English language has undergone changes both overall and with regional variations, through Old and Middle English to the present. The sounds heard in modern English were significantly influenced by the ...
, but does not exhibit cot-caught merger or wine-whine merger. Each phoneme is represented by a sequence of one or more characters. Some of the sequences are delimited with a slash character "/", as shown in the following table, but note that the sequence for is delimited by ''two'' slash characters at either end: To this collection are added a number of extra sequences representing phonemes found in several other languages. These are used to encode the non-English words, phrases and names that are included in the database. The following table contains these extra phonemes, but note that the extent to which some of these may exist due to encoding errors is not clear.


Shakespeare

Moby Shakespeare contains the complete unabridged works of
Shakespeare William Shakespeare ( 26 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's nation ...
. This specific resource is not available from Project Gutenberg, but it is available in a 1993 version on the web.


Thesaurus

The Moby Thesaurus II contains 30,260 root words, with 2,520,264
synonym A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are all ...
s and related terms – an average of 83.3 per root word. Each line consists of a list of
comma-separated values A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separat ...
, with the first term being the root word, and all following words being related terms.
Grady Ward William Grady Ward (born April 4, 1951) is an American software engineer, lexicographer, and Internet activist who has been prominent in the Scientology versus the Internet controversy. Biography Grady Ward created the Moby Project, an extensive ...
placed this thesaurus in the
public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...
in 1996. It is also available as a
Debian Debian (), also known as Debian GNU/Linux, is a Linux distribution composed of free and open-source software, developed by the community-supported Debian Project, which was established by Ian Murdock on August 16, 1993. The first version of D ...
package although the package has been discontinued starting with
Bullseye Bullseye or Bull's Eye may refer to: Symbols * ◎ (Unicode U+25CE BULLSEYE), in the Geometric Shapes Unicode block * (Unicode U+0298 LATIN LETTER BILABIAL CLICK), the phonetic symbol for bilabial click Animals and plants * Bull's Eye, ''Euryo ...
.


Words

Moby Words II is the largest wordlist in the world. The distribution consists of the following 16 files:


References

{{reflist


External links


Moby Project homepage
University of Sheffield
copy
made by the
Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see ...
of the page as it was on 30 September 2017. ("Last modified: October 24, 2000"
working download site

Project Gutenberg downloads
*

'
corresponding code
* Wiktionary:Appendix:Moby Thesaurus II Public domain databases Corpora Linguistic research