Common Voice is a
crowdsourcing project started by
Mozilla
Mozilla (stylized as moz://a) is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, spreads and supports Mozilla products, thereby promoting exclusively free software and open standards, w ...
to create a free
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...
for
speech recognition software
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
. The project is supported by
volunteer
Volunteering is a voluntary act of an individual or group freely giving time and labor for community service. Many volunteers are specifically trained in the areas they work, such as medicine, education, or emergency rescue. Others serve ...
s who record sample sentences with a
microphone
A microphone, colloquially called a mic or mike (), is a transducer that converts sound into an electrical signal. Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and publ ...
and review recordings of other users. The transcribed sentences will be collected in a voice database available under the
public domain
The public domain (PD) consists of all the creative work to which no exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly waived, or may be inapplicable. Because those rights have expired, ...
license
CC0
A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
. This license ensures that
developer
Developer may refer to:
Computers
* Software developer, a person or organization who develop programs/applications
* Video game developer, a person or business involved in video game development, the process of designing and creating games
* Web d ...
s can use the database for voice-to-text applications without restrictions or costs.
Aims
Common Voice aims to provide diverse voice samples. According to Mozilla's
Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents.
History
At the beginning of 2022, the Bengali.AI partnered with commonvoice to launch "Bangla Speech Recognition" project that aims to make machines understand
Bangla language
Bengali ( ), generally known by its endonym Bangla (, ), is an Indo-Aryan language native to the Bengal region of South Asia. It is the official, national, and most widely spoken language of Bangladesh and the second most widely spoken of t ...
. 2000 hours of voice was collected with aim for higher than 10,000 hours.
Voice database
The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences.
In February 2019, the first batch of languages was released for use. This included 18 languages:
English
English usually refers to:
* English language
* English people
English may also refer to:
Peoples, culture, and language
* ''English'', an adjective for something of, from, or related to England
** English national ide ...
,
French,
German
German(s) may refer to:
* Germany (of or related to)
** Germania (historical use)
* Germans, citizens of Germany, people of German ancestry, or native speakers of the German language
** For citizens of Germany, see also German nationality law
**Ge ...
and
Mandarin Chinese
Mandarin (; ) is a group of Chinese (Sinitic) dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language ...
, but also less prevalent languages as
Welsh and
Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors.
As of July 2020 the database has amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which has been verified by volunteers.
In May 2021, following the work to add
Kinyarwanda
Kinyarwanda, Rwandan or Rwanda, officially known as Ikinyarwanda, is a Bantu language and a dialect of the Rwanda-Rundi language that is spoken in Rwanda and adjacent parts of Burundi, the Democratic Republic of the Congo, Uganda (where t ...
, they received a grant to add
Kiswahili
Swahili, also known by its local name , is the native language of the Swahili people, who are found primarily in Tanzania, Kenya and Mozambique (along the East African coast and adjacent litoral islands). It is a Bantu language, though Swahili ...
.
In September 2022, it was announced that the
Twi language
Twi () is a dialect of the Akan language spoken in southern and central Ghana by several million people, mainly of the Akan people, the largest of the seventeen major ethnic groups in Ghana. Twi has about 17-18 million speakers in total, includ ...
of Ghana was the 100th language to be added to the Mozilla Common Voice database.
As of October 2022, Mozilla Common Voice officially collects voice data for the following languages:
*
Abkhaz
*
Arabic
Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
*
Armenian
Armenian may refer to:
* Something of, from, or related to Armenia, a country in the South Caucasus region of Eurasia
* Armenians, the national people of Armenia, or people of Armenian descent
** Armenian Diaspora, Armenian communities across the ...
*
Assamese
*
Asturian
*
Bashkir
*
Basaa
*
Basque
Basque may refer to:
* Basques, an ethnic group of Spain and France
* Basque language, their language
Places
* Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France
* Basque Country (autonomous co ...
*
Belarusian
*
Bengali
Bengali or Bengalee, or Bengalese may refer to:
*something of, from, or related to Bengal, a large region in South Asia
* Bengalis, an ethnic and linguistic group of the region
* Bengali language, the language they speak
** Bengali alphabet, the w ...
*
Breton
*
Bulgarian
*
Catalan
Catalan may refer to:
Catalonia
From, or related to Catalonia:
* Catalan language, a Romance language
* Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia
Places
* 13178 Catalan, asteroid #1 ...
* Chinese (
Cantonese
Cantonese ( zh, t=廣東話, s=广东话, first=t, cy=Gwóngdūng wá) is a language within the Chinese (Sinitic) branch of the Sino-Tibetan languages originating from the city of Guangzhou (historically known as Canton) and its surrounding ar ...
and
Mandarin
Mandarin or The Mandarin may refer to:
Language
* Mandarin Chinese, branch of Chinese originally spoken in northern parts of the country
** Standard Chinese or Modern Standard Mandarin, the official language of China
** Taiwanese Mandarin, Stand ...
varieties)
*
Chuvash
*
Czech
Czech may refer to:
* Anything from or related to the Czech Republic, a country in Europe
** Czech language
** Czechs, the people of the area
** Czech culture
** Czech cuisine
* One of three mythical brothers, Lech, Czech, and Rus'
Places
* Czech, ...
*
Danish
Danish may refer to:
* Something of, from, or related to the country of Denmark
People
* A national or citizen of Denmark, also called a "Dane," see Demographics of Denmark
* Culture of Denmark
* Danish people or Danes, people with a Danish a ...
*
Dhivehi
*
Dutch
Dutch commonly refers to:
* Something of, from, or related to the Netherlands
* Dutch people ()
* Dutch language ()
Dutch may also refer to:
Places
* Dutch, West Virginia, a community in the United States
* Pennsylvania Dutch Country
People E ...
*
English
English usually refers to:
* English language
* English people
English may also refer to:
Peoples, culture, and language
* ''English'', an adjective for something of, from, or related to England
** English national ide ...
*
Esperanto
*
Erzya
*
Finnish
Finnish may refer to:
* Something or someone from, or related to Finland
* Culture of Finland
* Finnish people or Finns, the primary ethnic group in Finland
* Finnish language, the national language of the Finnish people
* Finnish cuisine
See also ...
*
French
*
Frisian
*
Galician
*
Georgian
Georgian may refer to:
Common meanings
* Anything related to, or originating from Georgia (country)
** Georgians, an indigenous Caucasian ethnic group
** Georgian language, a Kartvelian language spoken by Georgians
**Georgian scripts, three scrip ...
*
German
German(s) may refer to:
* Germany (of or related to)
** Germania (historical use)
* Germans, citizens of Germany, people of German ancestry, or native speakers of the German language
** For citizens of Germany, see also German nationality law
**Ge ...
*
Greek
Greek may refer to:
Greece
Anything of, from, or related to Greece, a country in Southern Europe:
*Greeks, an ethnic group.
*Greek language, a branch of the Indo-European language family.
**Proto-Greek language, the assumed last common ancestor ...
*
Guaraní Guarani, Guaraní or Guarany may refer to
Ethnography
* Guaraní people, an indigenous people from South America's interior (Brazil, Argentina, Paraguay and Bolivia)
* Guaraní language, or Paraguayan Guarani, an official language of Paraguay
* ...
*
Hausa
Hausa may refer to:
* Hausa people, an ethnic group of West Africa
* Hausa language, spoken in West Africa
* Hausa Kingdoms, a historical collection of Hausa city-states
* Hausa (horse) or Dongola horse, an African breed of riding horse
See also
...
*
Hakha Chin
*
Hindi
Hindi ( Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been ...
*
Hungarian
*
Indonesian
*
Interlingua
Interlingua (; ISO 639 language codes ia, ina) is an international auxiliary language (IAL) developed between 1937 and 1951 by the American International Auxiliary Language Association (IALA). It ranks among the most widely used IALs and is t ...
*
Irish
Irish may refer to:
Common meanings
* Someone or something of, from, or related to:
** Ireland, an island situated off the north-western coast of continental Europe
***Éire, Irish language name for the isle
** Northern Ireland, a constituent unit ...
*
Italian
Italian(s) may refer to:
* Anything of, from, or related to the people of Italy over the centuries
** Italians, an ethnic group or simply a citizen of the Italian Republic or Italian Kingdom
** Italian language, a Romance language
*** Regional Ita ...
*
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
*
Kabyle
*
Kazakh
*
Kinyarwanda
Kinyarwanda, Rwandan or Rwanda, officially known as Ikinyarwanda, is a Bantu language and a dialect of the Rwanda-Rundi language that is spoken in Rwanda and adjacent parts of Burundi, the Democratic Republic of the Congo, Uganda (where t ...
*
Korean
Korean may refer to:
People and culture
* Koreans, ethnic group originating in the Korean Peninsula
* Korean cuisine
* Korean culture
* Korean language
**Korean alphabet, known as Hangul or Chosŏn'gŭl
**Korean dialects and the Jeju language
** ...
* Kurdish (
Central and
Kurmanji
Kurmanji ( ku, کورمانجی, lit=Kurdish, translit=Kurmancî, also termed Northern Kurdish, is the northern dialect of the Kurdish languages, spoken predominantly in southeast Turkey, northwest and northeast Iran, northern Iraq, northern Sy ...
varieties)
*
Kyrgyz
*
Latvian
*
Luganda
*
Macedonian
*
Malayalam
Malayalam (; , ) is a Dravidian languages, Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry (union territory), Puducherry (Mahé district) by the Malayali people. It is one of 2 ...
*
Maltese
*
Marathi
* Mari (
Meadow
A meadow ( ) is an open habitat, or field, vegetated by grasses, herbs, and other non- woody plants. Trees or shrubs may sparsely populate meadows, as long as these areas maintain an open character. Meadows may be naturally occurring or arti ...
and
Hill
A hill is a landform that extends above the surrounding terrain. It often has a distinct summit.
Terminology
The distinction between a hill and a mountain is unclear and largely subjective, but a hill is universally considered to be not a ...
varieties)
*
Moksha
''Moksha'' (; sa, मोक्ष, '), also called ''vimoksha'', ''vimukti'' and ''mukti'', is a term in Hinduism, Buddhism, Jainism and Sikhism for various forms of emancipation, enlightenment, liberation, and release. In its soteriologic ...
*
Mongolian
*
Nepali
* Norwegian (
Nynorsk
Nynorsk () () is one of the two written standards of the Norwegian language, the other being Bokmål. From 12 May 1885, it became the state-sanctioned version of Ivar Aasen's standard Norwegian language ( no, Landsmål) parallel to the Dano-N ...
)
*
Odia
*
Persian
Persian may refer to:
* People and things from Iran, historically called ''Persia'' in the English language
** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples
** Persian language, an Iranian language of the ...
*
Polish
Polish may refer to:
* Anything from or related to Poland, a country in Europe
* Polish language
* Poles
Poles,, ; singular masculine: ''Polak'', singular feminine: ''Polka'' or Polish people, are a West Slavic nation and ethnic group, w ...
*
Portuguese
Portuguese may refer to:
* anything of, from, or related to the country and nation of Portugal
** Portuguese cuisine, traditional foods
** Portuguese language, a Romance language
*** Portuguese dialects, variants of the Portuguese language
** Portu ...
*
Punjabi
*
Romanian
Romanian may refer to:
*anything of, from, or related to the country and nation of Romania
**Romanians, an ethnic group
**Romanian language, a Romance language
*** Romanian dialects, variants of the Romanian language
** Romanian cuisine, tradition ...
* Romansh (
Sursilvan and
Vallader
Vallader (Vallader, Sursilvan, Puter, Surmiran, and Rumantsch Grischun: ''vallader'' ; Sutsilvan: ') is a variety of the Romansh language spoken in the Lower Engadine valley (''Engiadina Bassa'') of southeast Switzerland, between Martina ...
varieties)
*
Russian
Russian(s) refers to anything related to Russia, including:
*Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries
*Rossiyane (), Russian language term for all citizens and peo ...
*
Sakha
Sakha, officially the Republic of Sakha (Yakutia),, is the largest republic of Russia, located in the Russian Far East, along the Arctic Ocean, with a population of roughly 1 million. Sakha comprises half of the area of its governing Far Ea ...
*
Santali
*
Saraiki
*
Sardinian
*
Serbian
*
Slovenian
Slovene or Slovenian may refer to:
* Something of, from, or related to Slovenia, a country in Central Europe
* Slovene language, a South Slavic language mainly spoken in Slovenia
* Slovenes
The Slovenes, also known as Slovenians ( sl, Sloven ...
*
Spanish
Spanish might refer to:
* Items from or related to Spain:
**Spaniards are a nation and ethnic group indigenous to Spain
**Spanish language, spoken in Spain and many Latin American countries
**Spanish cuisine
Other places
* Spanish, Ontario, Can ...
*
Swahili
*
Swedish
Swedish or ' may refer to:
Anything from or related to Sweden, a country in Northern Europe. Or, specifically:
* Swedish language, a North Germanic language spoken primarily in Sweden and Finland
** Swedish alphabet, the official alphabet used by ...
*
Taiwanese Hokkien
Taiwanese Hokkien () (; Tâi-lô: ''Tâi-uân-uē''), also known as Taigi/Taigu (; Pe̍h-ōe-jī/ Tâi-lô: ''Tâi-gí / Tâi-gú''), Taiwanese, Taiwanese Minnan, Hoklo and Holo, is a variety of the Hokkien language spoken natively by about ...
*
Tamil
Tamil may refer to:
* Tamils, an ethnic group native to India and some other parts of Asia
**Sri Lankan Tamils, Tamil people native to Sri Lanka also called ilankai tamils
**Tamil Malaysians, Tamil people native to Malaysia
* Tamil language, nativ ...
*
Tatar
The Tatars ()[Tatar]
in the Collins English Dictionary is an umbrella term for different
*
Thai
*
Tigre
*
Tigrinya
(; also spelled Tigrigna) is an Ethio-Semitic language commonly spoken Eritrea and in northern Ethiopia's Tigray Region by the Tigrinya and Tigrayan peoples. It is also spoken by the global diaspora of these regions.
History and literatur ...
*
Toki Pona
Toki Pona (rendered as ''toki pona'' and often translated as 'the language of good'; ; ) is a philosophical artistic constructed language (philosophical artlang) known for its small vocabulary, simplicity, and ease of acquisition. It was create ...
*
Twi
Twi () is a dialect of the Akan language spoken in southern and central Ghana by several million people, mainly of the Akan people, the largest of the seventeen major ethnic groups in Ghana. Twi has about 17-18 million speakers in total, includ ...
*
Turkish
*
Upper Sorbian
Upper Sorbian (), occasionally referred to as "Wendish", is a minority language spoken by Sorbs in Germany in the historical province of Upper Lusatia, which is today part of Saxony. It is grouped in the West Slavic language branch, together ...
*
Ukrainian
Ukrainian may refer to:
* Something of, from, or related to Ukraine
* Something relating to Ukrainians, an East Slavic people from Eastern Europe
* Something relating to demographics of Ukraine in terms of demography and population of Ukraine
* So ...
*
*
Uyghur
*
Uzbek
*
Vietnamese
Vietnamese may refer to:
* Something of, from, or related to Vietnam, a country in Southeast Asia
** A citizen of Vietnam. See Demographics of Vietnam.
* Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam
** Overse ...
*
Votic
Votic, or Votian (''vaďďa tšeeli'', ''maatšeeli'') vɑːdʔda ˈtʃɨlɨ, mɑːt.ʃɨlɨ is the language spoken by the Votes of Ingria, belonging to the Finnic branch of the Uralic languages. Votic is spoken only in Krakolye and Luzhits ...
*
Welsh
See also
*
Forvo
*
Lingua Libre
Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license.
Description
Lingua Libre enables to record words, phrases ...
*
Crowdsource (app)
References
{{Mozilla
Speech recognition software
Datasets in machine learning