Common Voice is a
crowdsourcing
Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digita ...
project started by
Mozilla to create a free
database for
speech recognition software
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...
. The project is supported by
volunteers who record sample sentences with a
microphone and review recordings of other users. The transcribed sentences will be collected in a voice database available under the
public domain license
CC0. This license ensures that
developers can use the database for voice-to-text applications without restrictions or costs.
Aims
Common Voice aims to provide diverse voice samples. According to Mozilla's
Katharina Borchert
Katharina Borchert (born Bochum, 1972) is the Chief Innovation Officer at Mozilla. She is a German journalist and was previously the managing director at ''Spiegel Online''. Borchert served on the Mozilla Board of Directors from 2014 to 2015, be ...
, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents.
History
At the beginning of 2022, the Bengali.AI partnered with commonvoice to launch "Bangla Speech Recognition" project that aims to make machines understand
Bangla language. 2000 hours of voice was collected with aim for higher than 10,000 hours.
Voice database
The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences.
In February 2019, the first batch of languages was released for use. This included 18 languages:
English,
French
French (french: français(e), link=no) may refer to:
* Something of, from, or related to France
** French language, which originated in France, and its various dialects and accents
** French people, a nation and ethnic group identified with Franc ...
,
German and
Mandarin Chinese, but also less prevalent languages as
Welsh and
Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors.
As of July 2020 the database has amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which has been verified by volunteers.
In May 2021, following the work to add
Kinyarwanda, they received a grant to add
Kiswahili.
In September 2022, it was announced that the
Twi language of Ghana was the 100th language to be added to the Mozilla Common Voice database.
As of October 2022, Mozilla Common Voice officially collects voice data for the following languages:
*
Abkhaz
*
Arabic
*
Armenian
*
Assamese
Assamese may refer to:
* Assamese people, a socio-ethnolinguistic identity of north-eastern India
* People of Assam, multi-ethnic, multi-linguistic and multi-religious people of Assam
* Assamese language, one of the easternmost Indo-Aryan language ...
*
Asturian
*
Bashkir
*
Basaa
Basaa (also spelled ''Bassa, Basa, Bissa''), or Mbene, is a Bantu language spoken in Cameroon by the Basaa people. It is spoken by about 300,000 people in the Centre and Littoral regions.
Maho (2009) lists North and South Kogo as dialects.
B ...
*
Basque
*
Belarusian
Belarusian may refer to:
* Something of, or related to Belarus
* Belarusians, people from Belarus, or of Belarusian descent
* A citizen of Belarus, see Demographics of Belarus
* Belarusian language
* Belarusian culture
* Belarusian cuisine
* Byelor ...
*
Bengali
*
Breton
Breton most often refers to:
*anything associated with Brittany, and generally
** Breton people
** Breton language, a Southwestern Brittonic Celtic language of the Indo-European language family, spoken in Brittany
** Breton (horse), a breed
**Ga ...
*
Bulgarian
*
Catalan
* Chinese (
Cantonese and
Mandarin
Mandarin or The Mandarin may refer to:
Language
* Mandarin Chinese, branch of Chinese originally spoken in northern parts of the country
** Standard Chinese or Modern Standard Mandarin, the official language of China
** Taiwanese Mandarin, Stand ...
varieties)
*
Chuvash
*
Czech
*
Danish
*
Dhivehi
Dhivehi, also spelled Divehi, may refer to:
*Dhivehi people, an ethnic group native to the historic region of the Maldive Islands.
*Dhivehi language, an Indo-Aryan language predominantly spoken by about 350,000 people in the Republic of Maldives
...
*
Dutch
*
English
*
Esperanto
Esperanto ( or ) is the world's most widely spoken constructed international auxiliary language. Created by the Warsaw-based ophthalmologist L. L. Zamenhof in 1887, it was intended to be a universal second language for international communi ...
*
Erzya
*
Finnish
*
French
French (french: français(e), link=no) may refer to:
* Something of, from, or related to France
** French language, which originated in France, and its various dialects and accents
** French people, a nation and ethnic group identified with Franc ...
*
Frisian
*
Galician
*
Georgian
*
German
*
Greek
*
Guaraní
*
Hausa
*
Hakha Chin
Hakha Chin, Laiholh, or Pawi is a Kuki-Chin languages, Kuki-Chin language spoken in central Chin State in Myanmar, and Lawngtlai district of Mizoram, India. Hakha Chin-speaking minorities are also found in the Sagaing and Magway Regions of Mya ...
*
Hindi
*
Hungarian
*
Indonesian
*
Interlingua
*
Irish
*
Italian
*
Japanese
*
Kabyle
*
Kazakh
Kazakh, Qazaq or Kazakhstani may refer to:
* Someone or something related to Kazakhstan
*Kazakhs, an ethnic group
*Kazakh language
*The Kazakh Khanate
* Kazakh cuisine
* Qazakh Rayon, Azerbaijan
*Qazax, Azerbaijan
*Kazakh Uyezd, administrative dis ...
*
Kinyarwanda
*
Korean
* Kurdish (
Central
Central is an adjective usually referring to being in the center of some place or (mathematical) object.
Central may also refer to:
Directions and generalised locations
* Central Africa, a region in the centre of Africa continent, also known as ...
and
Kurmanji varieties)
*
Kyrgyz Kyrgyz, Kirghiz or Kyrgyzstani may refer to:
* Someone or something related to Kyrgyzstan
*Kyrgyz people
*Kyrgyz national games
*Kyrgyz language
*Kyrgyz culture
*Kyrgyz cuisine
*Yenisei Kirghiz
*The Fuyü Gïrgïs language in Northeastern China
...
*
Latvian
*
Luganda
The Ganda language or Luganda (, , ) is a Bantu language spoken in the African Great Lakes region. It is one of the major languages in Uganda and is spoken by more than 10 million Baganda and other people principally in central Uganda including ...
*
Macedonian
Macedonian most often refers to someone or something from or related to Macedonia.
Macedonian(s) may specifically refer to:
People Modern
* Macedonians (ethnic group), a nation and a South Slavic ethnic group primarily associated with North M ...
*
Malayalam
*
Maltese
Maltese may refer to:
* Someone or something of, from, or related to Malta
* Maltese alphabet
* Maltese cuisine
* Maltese culture
* Maltese language, the Semitic language spoken by Maltese people
* Maltese people, people from Malta or of Malte ...
*
Marathi
Marathi may refer to:
*Marathi people, an Indo-Aryan ethnolinguistic group of Maharashtra, India
*Marathi language, the Indo-Aryan language spoken by the Marathi people
*Palaiosouda, also known as Marathi, a small island in Greece
See also
*
* ...
* Mari (
Meadow and
Hill varieties)
*
Moksha
*
Mongolian
*
Nepali
Nepali or Nepalese may refer to :
Concerning Nepal
* Anything of, from, or related to Nepal
* Nepali people, citizens of Nepal
* Nepali language, an Indo-Aryan language found in Nepal, the current official national language and a language spoken ...
* Norwegian (
Nynorsk)
*
Odia
Odia, also spelled Oriya or Odiya, may refer to:
* Odia people in Odisha, India
* Odia language, an Indian language, belonging to the Indo-Aryan branch of the Indo-European language family
* Odia alphabet, a writing system used for the Odia languag ...
*
Persian
*
Polish
*
Portuguese
*
Punjabi
Punjabi, or Panjabi, most often refers to:
* Something of, from, or related to Punjab, a region in India and Pakistan
* Punjabi language
* Punjabi people
* Punjabi dialects and languages
Punjabi may also refer to:
* Punjabi (horse), a British Th ...
*
Romanian
* Romansh (
Sursilvan
Sursilvan (; also ''romontsch sursilvan'' ; Sursilvan, Vallader, Surmiran, Sutsilvan, and Rumantsch Grischun: ''sursilvan''; Puter: ''sursilvaun'') is a group of dialects of the Romansh language spoken in the Swiss district of Surselva. It is t ...
and
Vallader varieties)
*
Russian
*
Sakha
*
Santali
*
Saraiki
*
Sardinian
*
Serbian
Serbian may refer to:
* someone or something related to Serbia, a country in Southeastern Europe
* someone or something related to the Serbs, a South Slavic people
* Serbian language
* Serbian names
See also
*
*
* Old Serbian (disambiguat ...
*
Slovenian
*
Spanish
*
Swahili
Swahili may refer to:
* Swahili language, a Bantu language official in Kenya, Tanzania and Uganda and widely spoken in the African Great Lakes
* Swahili people, an ethnic group in East Africa
* Swahili culture
Swahili culture is the culture of ...
*
Swedish
Swedish or ' may refer to:
Anything from or related to Sweden, a country in Northern Europe. Or, specifically:
* Swedish language, a North Germanic language spoken primarily in Sweden and Finland
** Swedish alphabet, the official alphabet used by ...
*
Taiwanese Hokkien
*
Tamil
*
Tatar
*
Thai
Thai or THAI may refer to:
* Of or from Thailand, a country in Southeast Asia
** Thai people, the dominant ethnic group of Thailand
** Thai language, a Tai-Kadai language spoken mainly in and around Thailand
*** Thai script
*** Thai (Unicode block ...
*
Tigre
*
Tigrinya
*
Toki Pona
*
Twi
*
Turkish
Turkish may refer to:
*a Turkic language spoken by the Turks
* of or about Turkey
** Turkish language
*** Turkish alphabet
** Turkish people, a Turkic ethnic group and nation
*** Turkish citizen, a citizen of Turkey
*** Turkish communities and mi ...
*
Upper Sorbian
Upper Sorbian (), occasionally referred to as "Wendish", is a minority language spoken by Sorbs in Germany in the historical province of Upper Lusatia, which is today part of Saxony. It is grouped in the West Slavic language branch, together ...
*
Ukrainian
*
Urdu
*
Uyghur
*
Uzbek
*
Vietnamese
*
Votic
*
Welsh
See also
*
Forvo
*
Lingua Libre
*
Crowdsource (app)
Crowdsource is a crowdsourcing platform developed by Google intended to improve a host of Google services through the user-facing training of different Algorithm, algorithms.
Crowdsource was released for the Android (operating system), Andro ...
References
{{Mozilla
Speech recognition software
Datasets in machine learning