HOME

TheInfoList



OR:

Common Voice is a crowdsourcing project started by
Mozilla Mozilla (stylized as moz://a) is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, spreads and supports Mozilla products, thereby promoting exclusively free software and open standards, w ...
to create a free
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...
for
speech recognition software Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
. The project is supported by
volunteer Volunteering is a voluntary act of an individual or group freely giving time and labor for community service. Many volunteers are specifically trained in the areas they work, such as medicine, education, or emergency rescue. Others serve ...
s who record sample sentences with a
microphone A microphone, colloquially called a mic or mike (), is a transducer that converts sound into an electrical signal. Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and publ ...
and review recordings of other users. The transcribed sentences will be collected in a voice database available under the
public domain The public domain (PD) consists of all the creative work to which no exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly waived, or may be inapplicable. Because those rights have expired, ...
license
CC0 A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
. This license ensures that
developer Developer may refer to: Computers * Software developer, a person or organization who develop programs/applications * Video game developer, a person or business involved in video game development, the process of designing and creating games * Web d ...
s can use the database for voice-to-text applications without restrictions or costs.


Aims

Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents.


History

At the beginning of 2022, the Bengali.AI partnered with commonvoice to launch "Bangla Speech Recognition" project that aims to make machines understand
Bangla language Bengali ( ), generally known by its endonym Bangla (, ), is an Indo-Aryan language native to the Bengal region of South Asia. It is the official, national, and most widely spoken language of Bangladesh and the second most widely spoken of t ...
. 2000 hours of voice was collected with aim for higher than 10,000 hours.


Voice database

The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences. In February 2019, the first batch of languages was released for use. This included 18 languages:
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
, French,
German German(s) may refer to: * Germany (of or related to) ** Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
and
Mandarin Chinese Mandarin (; ) is a group of Chinese (Sinitic) dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language ...
, but also less prevalent languages as Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors. As of July 2020 the database has amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which has been verified by volunteers. In May 2021, following the work to add
Kinyarwanda Kinyarwanda, Rwandan or Rwanda, officially known as Ikinyarwanda, is a Bantu language and a dialect of the Rwanda-Rundi language that is spoken in Rwanda and adjacent parts of Burundi, the Democratic Republic of the Congo, Uganda (where t ...
, they received a grant to add
Kiswahili Swahili, also known by its local name , is the native language of the Swahili people, who are found primarily in Tanzania, Kenya and Mozambique (along the East African coast and adjacent litoral islands). It is a Bantu language, though Swahili ...
. In September 2022, it was announced that the
Twi language Twi () is a dialect of the Akan language spoken in southern and central Ghana by several million people, mainly of the Akan people, the largest of the seventeen major ethnic groups in Ghana. Twi has about 17-18 million speakers in total, includ ...
of Ghana was the 100th language to be added to the Mozilla Common Voice database. As of October 2022, Mozilla Common Voice officially collects voice data for the following languages: * Abkhaz *
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
*
Armenian Armenian may refer to: * Something of, from, or related to Armenia, a country in the South Caucasus region of Eurasia * Armenians, the national people of Armenia, or people of Armenian descent ** Armenian Diaspora, Armenian communities across the ...
* Assamese * Asturian * Bashkir * Basaa *
Basque Basque may refer to: * Basques, an ethnic group of Spain and France * Basque language, their language Places * Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France * Basque Country (autonomous co ...
* Belarusian *
Bengali Bengali or Bengalee, or Bengalese may refer to: *something of, from, or related to Bengal, a large region in South Asia * Bengalis, an ethnic and linguistic group of the region * Bengali language, the language they speak ** Bengali alphabet, the w ...
* Breton * Bulgarian *
Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...
* Chinese (
Cantonese Cantonese ( zh, t=廣東話, s=广东话, first=t, cy=Gwóngdūng wá) is a language within the Chinese (Sinitic) branch of the Sino-Tibetan languages originating from the city of Guangzhou (historically known as Canton) and its surrounding ar ...
and
Mandarin Mandarin or The Mandarin may refer to: Language * Mandarin Chinese, branch of Chinese originally spoken in northern parts of the country ** Standard Chinese or Modern Standard Mandarin, the official language of China ** Taiwanese Mandarin, Stand ...
varieties) * Chuvash *
Czech Czech may refer to: * Anything from or related to the Czech Republic, a country in Europe ** Czech language ** Czechs, the people of the area ** Czech culture ** Czech cuisine * One of three mythical brothers, Lech, Czech, and Rus' Places * Czech, ...
*
Danish Danish may refer to: * Something of, from, or related to the country of Denmark People * A national or citizen of Denmark, also called a "Dane," see Demographics of Denmark * Culture of Denmark * Danish people or Danes, people with a Danish a ...
* Dhivehi *
Dutch Dutch commonly refers to: * Something of, from, or related to the Netherlands * Dutch people () * Dutch language () Dutch may also refer to: Places * Dutch, West Virginia, a community in the United States * Pennsylvania Dutch Country People E ...
*
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
* Esperanto * Erzya *
Finnish Finnish may refer to: * Something or someone from, or related to Finland * Culture of Finland * Finnish people or Finns, the primary ethnic group in Finland * Finnish language, the national language of the Finnish people * Finnish cuisine See also ...
* French * Frisian * Galician *
Georgian Georgian may refer to: Common meanings * Anything related to, or originating from Georgia (country) ** Georgians, an indigenous Caucasian ethnic group ** Georgian language, a Kartvelian language spoken by Georgians **Georgian scripts, three scrip ...
*
German German(s) may refer to: * Germany (of or related to) ** Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
*
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
*
Guaraní Guarani, Guaraní or Guarany may refer to Ethnography * Guaraní people, an indigenous people from South America's interior (Brazil, Argentina, Paraguay and Bolivia) * Guaraní language, or Paraguayan Guarani, an official language of Paraguay * ...
*
Hausa Hausa may refer to: * Hausa people, an ethnic group of West Africa * Hausa language, spoken in West Africa * Hausa Kingdoms, a historical collection of Hausa city-states * Hausa (horse) or Dongola horse, an African breed of riding horse See also ...
* Hakha Chin *
Hindi Hindi ( Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been ...
* Hungarian * Indonesian *
Interlingua Interlingua (; ISO 639 language codes ia, ina) is an international auxiliary language (IAL) developed between 1937 and 1951 by the American International Auxiliary Language Association (IALA). It ranks among the most widely used IALs and is t ...
*
Irish Irish may refer to: Common meanings * Someone or something of, from, or related to: ** Ireland, an island situated off the north-western coast of continental Europe ***Éire, Irish language name for the isle ** Northern Ireland, a constituent unit ...
*
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, an ethnic group or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance language *** Regional Ita ...
*
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
* Kabyle * Kazakh *
Kinyarwanda Kinyarwanda, Rwandan or Rwanda, officially known as Ikinyarwanda, is a Bantu language and a dialect of the Rwanda-Rundi language that is spoken in Rwanda and adjacent parts of Burundi, the Democratic Republic of the Congo, Uganda (where t ...
*
Korean Korean may refer to: People and culture * Koreans, ethnic group originating in the Korean Peninsula * Korean cuisine * Korean culture * Korean language **Korean alphabet, known as Hangul or Chosŏn'gŭl **Korean dialects and the Jeju language ** ...
* Kurdish ( Central and
Kurmanji Kurmanji ( ku, کورمانجی, lit=Kurdish, translit=Kurmancî, also termed Northern Kurdish, is the northern dialect of the Kurdish languages, spoken predominantly in southeast Turkey, northwest and northeast Iran, northern Iraq, northern Sy ...
varieties) * Kyrgyz * Latvian * Luganda * Macedonian *
Malayalam Malayalam (; , ) is a Dravidian languages, Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry (union territory), Puducherry (Mahé district) by the Malayali people. It is one of 2 ...
* Maltese * Marathi * Mari (
Meadow A meadow ( ) is an open habitat, or field, vegetated by grasses, herbs, and other non- woody plants. Trees or shrubs may sparsely populate meadows, as long as these areas maintain an open character. Meadows may be naturally occurring or arti ...
and
Hill A hill is a landform that extends above the surrounding terrain. It often has a distinct summit. Terminology The distinction between a hill and a mountain is unclear and largely subjective, but a hill is universally considered to be not a ...
varieties) *
Moksha ''Moksha'' (; sa, मोक्ष, '), also called ''vimoksha'', ''vimukti'' and ''mukti'', is a term in Hinduism, Buddhism, Jainism and Sikhism for various forms of emancipation, enlightenment, liberation, and release. In its soteriologic ...
* Mongolian * Nepali * Norwegian (
Nynorsk Nynorsk () () is one of the two written standards of the Norwegian language, the other being Bokmål. From 12 May 1885, it became the state-sanctioned version of Ivar Aasen's standard Norwegian language ( no, Landsmål) parallel to the Dano-N ...
) * Odia *
Persian Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the ...
*
Polish Polish may refer to: * Anything from or related to Poland, a country in Europe * Polish language * Poles Poles,, ; singular masculine: ''Polak'', singular feminine: ''Polka'' or Polish people, are a West Slavic nation and ethnic group, w ...
*
Portuguese Portuguese may refer to: * anything of, from, or related to the country and nation of Portugal ** Portuguese cuisine, traditional foods ** Portuguese language, a Romance language *** Portuguese dialects, variants of the Portuguese language ** Portu ...
* Punjabi *
Romanian Romanian may refer to: *anything of, from, or related to the country and nation of Romania **Romanians, an ethnic group **Romanian language, a Romance language *** Romanian dialects, variants of the Romanian language ** Romanian cuisine, tradition ...
* Romansh ( Sursilvan and
Vallader Vallader (Vallader, Sursilvan, Puter, Surmiran, and Rumantsch Grischun: ''vallader'' ; Sutsilvan: ') is a variety of the Romansh language spoken in the Lower Engadine valley (''Engiadina Bassa'') of southeast Switzerland, between Martina ...
varieties) *
Russian Russian(s) refers to anything related to Russia, including: *Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries *Rossiyane (), Russian language term for all citizens and peo ...
*
Sakha Sakha, officially the Republic of Sakha (Yakutia),, is the largest republic of Russia, located in the Russian Far East, along the Arctic Ocean, with a population of roughly 1 million. Sakha comprises half of the area of its governing Far Ea ...
* Santali * Saraiki * Sardinian * Serbian *
Slovenian Slovene or Slovenian may refer to: * Something of, from, or related to Slovenia, a country in Central Europe * Slovene language, a South Slavic language mainly spoken in Slovenia * Slovenes The Slovenes, also known as Slovenians ( sl, Sloven ...
*
Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Can ...
* Swahili *
Swedish Swedish or ' may refer to: Anything from or related to Sweden, a country in Northern Europe. Or, specifically: * Swedish language, a North Germanic language spoken primarily in Sweden and Finland ** Swedish alphabet, the official alphabet used by ...
*
Taiwanese Hokkien Taiwanese Hokkien () (; Tâi-lô: ''Tâi-uân-uē''), also known as Taigi/Taigu (; Pe̍h-ōe-jī/ Tâi-lô: ''Tâi-gí / Tâi-gú''), Taiwanese, Taiwanese Minnan, Hoklo and Holo, is a variety of the Hokkien language spoken natively by about ...
*
Tamil Tamil may refer to: * Tamils, an ethnic group native to India and some other parts of Asia **Sri Lankan Tamils, Tamil people native to Sri Lanka also called ilankai tamils **Tamil Malaysians, Tamil people native to Malaysia * Tamil language, nativ ...
*
Tatar The Tatars ()Tatar
in the Collins English Dictionary
is an umbrella term for different
* Thai * Tigre *
Tigrinya (; also spelled Tigrigna) is an Ethio-Semitic language commonly spoken Eritrea and in northern Ethiopia's Tigray Region by the Tigrinya and Tigrayan peoples. It is also spoken by the global diaspora of these regions. History and literatur ...
*
Toki Pona Toki Pona (rendered as ''toki pona'' and often translated as 'the language of good'; ; ) is a philosophical artistic constructed language (philosophical artlang) known for its small vocabulary, simplicity, and ease of acquisition. It was create ...
*
Twi Twi () is a dialect of the Akan language spoken in southern and central Ghana by several million people, mainly of the Akan people, the largest of the seventeen major ethnic groups in Ghana. Twi has about 17-18 million speakers in total, includ ...
* Turkish *
Upper Sorbian Upper Sorbian (), occasionally referred to as "Wendish", is a minority language spoken by Sorbs in Germany in the historical province of Upper Lusatia, which is today part of Saxony. It is grouped in the West Slavic language branch, together ...
*
Ukrainian Ukrainian may refer to: * Something of, from, or related to Ukraine * Something relating to Ukrainians, an East Slavic people from Eastern Europe * Something relating to demographics of Ukraine in terms of demography and population of Ukraine * So ...
*
Urdu Urdu (;"Urdu"
'' Uyghur * Uzbek *
Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia ** A citizen of Vietnam. See Demographics of Vietnam. * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Overse ...
*
Votic Votic, or Votian (''vaďďa tšeeli'', ''maatšeeli'') vɑːdʔda ˈtʃɨlɨ, mɑːt.ʃɨlɨ is the language spoken by the Votes of Ingria, belonging to the Finnic branch of the Uralic languages. Votic is spoken only in Krakolye and Luzhits ...
* Welsh


See also

* Forvo *
Lingua Libre Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license. Description Lingua Libre enables to record words, phrases ...
* Crowdsource (app)


References

{{Mozilla Speech recognition software Datasets in machine learning