Common Voice is a
crowdsourcing
Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digit ...
project started by
Mozilla
Mozilla is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, publishes and supports Mozilla products, thereby promoting free software and open standards. The community is supported institution ...
to create a free and open
speech corpus
A speech corpus (or spoken corpus) is a database of speech audio files and text Transcription (linguistics), transcriptions.
In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with ...
. The project is supported by
volunteer
Volunteering is an elective and freely chosen act of an individual or group giving their time and labor, often for community service. Many volunteers are specifically trained in the areas they work, such as medicine, education, or emergency ...
s who record sample sentences with a
microphone
A microphone, colloquially called a mic (), or mike, is a transducer that converts sound into an electrical signal. Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and publi ...
and review recordings of other users. The transcribed sentences are collected in a voice database available under the
public domain
The public domain (PD) consists of all the creative work to which no Exclusive exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly Waiver, waived, or may be inapplicable. Because no one holds ...
license
CC0. This license ensures that
developers can use the database for
voice-to-text and
text-to-voice applications without restrictions or costs.
Aims
Common Voice aims to provide diverse voice samples. According to Mozilla's
Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents.
Voice database
The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences.
In February 2019, the first batch of languages was released for use. This included 18 languages:
English,
French,
German
German(s) may refer to:
* Germany, the country of the Germans and German things
**Germania (Roman era)
* Germans, citizens of Germany, people of German ancestry, or native speakers of the German language
** For citizenship in Germany, see also Ge ...
and
Mandarin Chinese
Mandarin ( ; zh, s=, t=, p=Guānhuà, l=Mandarin (bureaucrat), officials' speech) is the largest branch of the Sinitic languages. Mandarin varieties are spoken by 70 percent of all Chinese speakers over a large geographical area that stretch ...
, but also less prevalent languages as
Welsh and
Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors.
As of July 2020 the database has amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which has been verified by volunteers.
In May 2021, following the work to add
Kinyarwanda
Kinyarwanda, Rwandan or Rwanda, officially known as Ikinyarwanda, is a Bantu language and the national language of Rwanda. It is a dialect of the Rwanda-Rundi language that is also spoken in adjacent parts of the Democratic Republic of the ...
, they received a grant to add
Kiswahili
Swahili, also known as as it is referred to in the Swahili language, is a Bantu language originally spoken by the Swahili people, who are found primarily in Tanzania, Kenya, and Mozambique (along the East African coast and adjacent littoral i ...
.
At the beginning of 2022, Bengali.AI partnered with Common Voice to launch the "Bangla Speech Recognition" project that aims to make machines understand the
Bangla language
Bengali, also known by its endonym Bangla (, , ), is an Indo-Aryan language belonging to the Indo-Iranian branch of the Indo-European language family. It is native to the Bengal region (Bangladesh, India's West Bengal and Tripura) of South ...
. 2000 hours of voice was collected with aim for higher than 10,000 hours.
In September 2022, it was announced that the
Twi language
Twi (; ) is the common name of the Akan literary language of Asante and Akuapem.
Effectively, it is a synonym for 'Akan' that is not used by the Fante people. It is not a linguistic grouping, but more of a common name used by inland Akans a ...
of Ghana was the 100th language to be added to the Mozilla Common Voice database.
, Mozilla Common Voice officially collects voice data for the following languages:
*
Abkhaz
*
Arabic
Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
*
Armenian
Armenian may refer to:
* Something of, from, or related to Armenia, a country in the South Caucasus region of Eurasia
* Armenians, the national people of Armenia, or people of Armenian descent
** Armenian diaspora, Armenian communities around the ...
*
Assamese
*
Asturian
*
Bashkir
*
Basaa
*
Basque
Basque may refer to:
* Basques, an ethnic group of Spain and France
* Basque language, their language
Places
* Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France
* Basque Country (autonomous co ...
*
Belarusian
*
Bengali
Bengali or Bengalee, or Bengalese may refer to:
*something of, from, or related to Bengal, a large region in South Asia
* Bengalis, an ethnic and linguistic group of the region
* Bengali language, the language they speak
** Bengali alphabet, the w ...
*
Breton
Breton most often refers to:
*anything associated with Brittany, and generally
**Breton people
**Breton language, a Southwestern Brittonic Celtic language of the Indo-European language family, spoken in Brittany
** Breton (horse), a breed
**Gale ...
*
Bulgarian
*
Catalan
* Chinese (
Cantonese
Cantonese is the traditional prestige variety of Yue Chinese, a Sinitic language belonging to the Sino-Tibetan language family. It originated in the city of Guangzhou (formerly known as Canton) and its surrounding Pearl River Delta. While th ...
and
Mandarin
Mandarin or The Mandarin may refer to:
Language
* Mandarin Chinese, branch of Chinese originally spoken in northern parts of the country
** Standard Chinese or Modern Standard Mandarin, the official language of China
** Taiwanese Mandarin, Stand ...
varieties)
*
Chuvash
*
Czech
Czech may refer to:
* Anything from or related to the Czech Republic, a country in Europe
** Czech language
** Czechs, the people of the area
** Czech culture
** Czech cuisine
* One of three mythical brothers, Lech, Czech, and Rus
*Czech (surnam ...
*
Danish
*
Dhivehi
*
Dutch
*
English
*
Esperanto
Esperanto (, ) is the world's most widely spoken Constructed language, constructed international auxiliary language. Created by L. L. Zamenhof in 1887 to be 'the International Language' (), it is intended to be a universal second language for ...
*
Erzya
*
Finnish
*
French
*
Frisian
*
Galician
*
Georgian
*
German
German(s) may refer to:
* Germany, the country of the Germans and German things
**Germania (Roman era)
* Germans, citizens of Germany, people of German ancestry, or native speakers of the German language
** For citizenship in Germany, see also Ge ...
*
Greek
Greek may refer to:
Anything of, from, or related to Greece, a country in Southern Europe:
*Greeks, an ethnic group
*Greek language, a branch of the Indo-European language family
**Proto-Greek language, the assumed last common ancestor of all kno ...
*
Guaraní Guarani, Guaraní or Guarany may refer to
Ethnography
* Guaraní people, an indigenous people from South America's interior (Brazil, Argentina, Paraguay and Bolivia)
* Guarani language, or Paraguayan Guarani, an official language of Paraguay
* G ...
*
Hausa
Hausa may refer to:
* Hausa people, an ethnic group of West Africa
* Hausa language, spoken in West Africa
* Hausa Kingdoms, a historical collection of Hausa city-states
* Hausa (horse) or Dongola horse, an African breed of riding horse
See also
...
*
Hakha Chin
*
Hindi
Modern Standard Hindi (, ), commonly referred to as Hindi, is the Standard language, standardised variety of the Hindustani language written in the Devanagari script. It is an official language of India, official language of the Government ...
*
Hungarian
*
Indonesian
*
Interlingua
Interlingua (, ) is an international auxiliary language (IAL) developed between 1937 and 1951 by the American International Auxiliary Language Association (IALA). It is a constructed language of the "naturalistic" variety, whose vocabulary, ...
*
Irish
*
Italian
Italian(s) may refer to:
* Anything of, from, or related to the people of Italy over the centuries
** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom
** Italian language, a Romance languag ...
*
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
*
Kabyle
*
Kazakh
*
Kinyarwanda
Kinyarwanda, Rwandan or Rwanda, officially known as Ikinyarwanda, is a Bantu language and the national language of Rwanda. It is a dialect of the Rwanda-Rundi language that is also spoken in adjacent parts of the Democratic Republic of the ...
*
Korean
Korean may refer to:
People and culture
* Koreans, people from the Korean peninsula or of Korean descent
* Korean culture
* Korean language
**Korean alphabet, known as Hangul or Korean
**Korean dialects
**See also: North–South differences in t ...
* Kurdish (
Central and
Kurmanji
Kurmanji (, ), also termed Northern Kurdish, is the northernmost of the Kurdish languages, spoken predominantly in southeast Turkey, northwest and northeast Iran, northern Iraq, northern Syria and the Caucasus and Khorasan regions. It is the ...
varieties)
*
Kyrgyz
*
Latvian
*
Luganda
Ganda or Luganda ( ; ) is a Bantu language spoken in the African Great Lakes region. It is one of the major languages in Uganda and is spoken by more than 5.56 million Ganda people, Baganda and other people principally in central Uganda, includ ...
*
Macedonian
*
Malayalam
Malayalam (; , ) is a Dravidian languages, Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry (union territory), Puducherry (Mahé district) by the Malayali people. It is one of ...
*
Maltese
Maltese may refer to:
* Someone or something of, from, or related to Malta
* Maltese alphabet
* Maltese cuisine
* Maltese culture
* Maltese language, the Semitic language spoken by Maltese people
* Maltese people, people from Malta or of Maltese ...
*
Marathi
Marathi may refer to:
*Marathi people, an Indo-Aryan ethnolinguistic group of Maharashtra, India
**Marathi people (Uttar Pradesh), the Marathi people in the Indian state of Uttar Pradesh
*Marathi language, the Indo-Aryan language spoken by the Mar ...
* Mari (
Meadow
A meadow ( ) is an open habitat or field, vegetated by grasses, herbs, and other non- woody plants. Trees or shrubs may sparsely populate meadows, as long as they maintain an open character. Meadows can occur naturally under favourable con ...
and
Hill
A hill is a landform that extends above the surrounding terrain. It often has a distinct summit, and is usually applied to peaks which are above elevation compared to the relative landmass, though not as prominent as Mountain, mountains. Hills ...
varieties)
*
Moksha
''Moksha'' (; , '), also called ''vimoksha'', ''vimukti'', and ''mukti'', is a term in Jainism, Buddhism, Hinduism, and Sikhism for various forms of emancipation, liberation, '' nirvana'', or release. In its soteriological and eschatologic ...
*
Mongolian
*
Nepali
* Norwegian (
Nynorsk
Nynorsk (; ) is one of the two official written standards of the Norwegian language, the other being Bokmål. From 12 May 1885, it became the state-sanctioned version of Ivar Aasen's standard Norwegian language (''Landsmål''), parallel to the Da ...
)
*
Odia
*
Pashto
Pashto ( , ; , ) is an eastern Iranian language in the Indo-European language family, natively spoken in northwestern Pakistan and southern and eastern Afghanistan. It has official status in Afghanistan and the Pakistani province of Khyb ...
*
Persian
Persian may refer to:
* People and things from Iran, historically called ''Persia'' in the English language
** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples
** Persian language, an Iranian language of the ...
*
Polish
*
Portuguese
*
Punjabi
*
Romanian
Romanian may refer to:
*anything of, from, or related to the country and nation of Romania
**Romanians, an ethnic group
**Romanian language, a Romance language
***Romanian dialects, variants of the Romanian language
**Romanian cuisine, traditional ...
* Romansh (
Sursilvan
Sursilvan (; also ''romontsch sursilvan'' ; Sursilvan, Vallader, Surmiran, Sutsilvan, and Rumantsch Grischun: ''sursilvan''; Puter: ''sursilvaun'') is a group of dialects of the Romansh language spoken in the Swiss district of Surselva. It is ...
and
Vallader
Vallader (Vallader, Sursilvan, Puter, Surmiran, and Rumantsch Grischun: ''vallader'' ; Sutsilvan: ') is a variety of the Romansh language spoken in the Lower Engadine valley (''Engiadina Bassa'') of southeast Switzerland, between Martina, Swi ...
varieties)
*
Russian
Russian(s) may refer to:
*Russians (), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries
*A citizen of Russia
*Russian language, the most widely spoken of the Slavic languages
*''The Russians'', a b ...
*
Sakha
*
Santali
*
Saraiki
*
Sardinian
*
Serbian
*
Slovenian
*
Spanish
Spanish might refer to:
* Items from or related to Spain:
**Spaniards are a nation and ethnic group indigenous to Spain
**Spanish language, spoken in Spain and many countries in the Americas
**Spanish cuisine
**Spanish history
**Spanish culture
...
*
Swahili
*
Swedish
*
Taiwanese Hokkien
Taiwanese Hokkien ( , ), or simply Taiwanese, also known as Taigi ( zh, c=臺語, tl=Tâi-gí), Taiwanese Southern Min ( zh, c=臺灣閩南語, tl=Tâi-uân Bân-lâm-gí), Hoklo and Holo, is a variety of the Hokkien language spoken natively ...
*
Tamil
Tamil may refer to:
People, culture and language
* Tamils, an ethno-linguistic group native to India, Sri Lanka, and some other parts of Asia
**Sri Lankan Tamils, Tamil people native to Sri Lanka
** Myanmar or Burmese Tamils, Tamil people of Ind ...
*
Tatar
Tatar may refer to:
Peoples
* Tatars, an umbrella term for different Turkic ethnic groups bearing the name "Tatar"
* Volga Tatars, a people from the Volga-Ural region of western Russia
* Crimean Tatars, a people from the Crimea peninsula by the B ...
*
Thai
*
Tigre
*
Tigrinya
*
Toki Pona
Toki Pona (; , , translated as 'the language of good') is a Philosophical language, philosophical, Artistic language, artistic, constructed language designed for its small vocabulary, simplicity, and ease of acquisition. It was created by Canadia ...
*
Twi
Twi (; ) is the common name of the Akan literary language of Asante and Akuapem.
Effectively, it is a synonym for 'Akan' that is not used by the Fante people. It is not a linguistic grouping, but more of a common name used by inland Akans as ...
*
Turkish
*
Upper Sorbian
Upper Sorbian (), occasionally referred to as Wendish (), is a minority language spoken by Sorbs in the historical province of Upper Lusatia, today part of Saxony, Germany. It is a West Slavic language, along with Lower Sorbian, Czech, Poli ...
*
Ukrainian
*
Urdu
Urdu (; , , ) is an Indo-Aryan languages, Indo-Aryan language spoken chiefly in South Asia. It is the Languages of Pakistan, national language and ''lingua franca'' of Pakistan. In India, it is an Eighth Schedule to the Constitution of Indi ...
*
Uyghur
Uyghur may refer to:
* Uyghurs, a Turkic ethnic group living in Eastern and Central Asia (West China)
** Uyghur language, a Turkic language spoken primarily by the Uyghurs
*** Old Uyghur language, a different Turkic language spoken in the Uyghur K ...
*
Uzbek
*
Vietnamese
Vietnamese may refer to:
* Something of, from, or related to Vietnam, a country in Southeast Asia
* Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam
** Overseas Vietnamese, Vietnamese people living outside Vietna ...
*
Votic
*
Welsh
See also
*
Forvo
*
Lingua Libre
*
Crowdsource (app)
References
{{Mozilla
Speech recognition software
Datasets in machine learning
Internet properties established in 2017