As of the early 2000s, several

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...

(SR) software packages exist for

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, w ...

. Some of them are free and open-source software and others are

proprietary software Proprietary software is software that is deemed within the free and open-source software to be non-free because its creator, publisher, or other rightsholder or rightsholder partner exercises a legal monopoly afforded by modern copyright and i ...

. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language.

Voice control A voice-user interface (VUI) makes spoken human interaction with computers possible, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device con ...

may refer to software used for communicating operational commands to a computer.

Linux native speech recognition

History

In the late 1990s, a Linux version of

ViaVoice IBM ViaVoice was a range of language-specific continuous speech recognition software products offered by IBM. The current version is designed primarily for use in embedded devices. The latest stable version of IBM Via Voice was 9.0 and was able ...

, created by IBM, was made available to users for no charge. In 2002, the free

software development kit A software development kit (SDK) is a collection of software development tools in one installable package. They facilitate the creation of applications by having a compiler, debugger and sometimes a software framework. They are normally specific ...

(SDK) was removed by the developer.

Development status

In the early 2000s, there was a push to get a high-quality Linux native speech recognition engine developed. As a result, several projects dedicated to creating Linux speech recognition programs were begun, such as Mycroft, which is similar to Microsoft Cortana, but open-source.

Speech sample crowdsourcing

It is essential to compile a

speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or spea ...

to produce acoustic models for

projects. VoxForge is a free speech corpus and acoustic model repository that was built to collect transcribed speech to be used in speech recognition projects. VoxForge accepts

crowdsourced Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digit ...

speech samples and corrections of recognized speech sequences. It is licensed under a

GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general ...

(GPL).

Speech recognition concept

The first step is to begin recording an audio stream on a computer. The user has two main processing options: * ''Discrete speech recognition'' (DSR) – processes information on a local machine entirely. This refers to self-contained systems in which all aspects of SR are performed entirely within the user's computer. This is becoming critical for protecting intellectual property (IP) and avoiding unwanted surveillance (2018). * ''Remote'' or ''server-based'' SR – transmits an audio speech file to a remote

server Server may refer to: Computing *Server (computing), a computer program or a device that provides functionality for other programs or devices, called clients Role * Waiting staff, those who work at a restaurant or a bar attending customers and su ...

to convert the file into a text string file. Due to recent cloud storage schemes and data mining, this method more easily allows surveillance, theft of information, and inserting malware. Remote recognition was formerly used by

smartphone A smartphone is a portable computer device that combines mobile telephone and computing functions into one unit. They are distinguished from feature phones by their stronger hardware capabilities and extensive mobile operating systems, whic ...

s because they lacked sufficient performance, working

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...

, or storage to process speech recognition within the phone. These limits have largely been overcome although server-based SR on mobile devices remains universal.

Speech recognition in browser

Discrete speech recognition can be performed within a

web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...

and works well with supported browsers. Remote SR does not require installing software on a desktop computer or mobile device as it is mainly a server-based system with the inherent security issues noted above. * ''Remote'': The dictation service records an audio track of the user via a web browser. * ''DSR'': Some solutions work on a client only, without sending data to servers.

Free speech recognition engines

The following is a list of projects dedicated to implementing speech recognition in Linux, and major native solutions. These are not end-user applications. These are programming

libraries A library is a collection of Document, materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or electronic media, digital access (soft copies) materials, and may be a ...

that may be used to develop end-user applications. *

CMU Sphinx CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model traine ...

is a general term to describe a group of speech recognition systems developed at Carnegie Mellon University. * HTK is the most famous and widely used speech recognition software before Kaldi. *

Julius The gens Julia (''gēns Iūlia'', ) was one of the most prominent patrician families in ancient Rome. Members of the gens attained the highest dignities of the state in the earliest times of the Republic. The first of the family to obtain the ...

is a high-performance, two-pass ''large vocabulary continuous speech recognition'' (LVCSR) decoder software for speech-related researchers and developers. *

Kaldi Kaldi or Khalid was a legendary Arab Ethiopian goatherd who discovered the coffee plant around 850 CE, according to popular legend, show some artwork depicting him, after which it entered the Islamic world and then the rest of the world. Story ...

is a toolkit for speech recognition provided under the Apache licence. *

Mozilla Mozilla (stylized as moz://a) is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, spreads and supports Mozilla products, thereby promoting exclusively free software and open standards, w ...

DeepSpeech is developing an open-source Speech-To-Text engine based on Baidu's deep speech research paper. * VoxForge is a free speech corpus and acoustic model repository for open-source speech recognition engines.

Proprietary speech recognition engines

* Janus Recognition Toolkit (JRTk) is a closed source speech recognition toolkit mainly targeted at Linux developed by the Interactive Systems Laboratories developed at Carnegie Mellon University and Karlsruhe Institute of Technology for which commercial and research licenses are available.

Voice control and keyboard shortcuts

Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language.

may refer to software used for sending operational commands to a computer or appliance. Voice control typically requires a much smaller vocabulary and thus is much easier to implement. Simple software combined with

keyboard shortcut computing, a keyboard shortcut also known as hotkey is a series of one or several keys to quickly invoke a software program or perform a preprogrammed action. This action may be part of the standard functionality of the operating system or ...

s, have the earliest potential for practically accurate voice control in Linux.

Running Windows speech recognition software with Linux

Via compatibility layer

It is possible to use programs such as

Dragon NaturallySpeaking Dragon NaturallySpeaking (also known as Dragon for PC, or DNS) is a speech recognition software package developed by Dragon Systems of Newton, Massachusetts, which was acquired in turn by Lernout & Hauspie Speech Products, Nuance Communications ...

in Linux, by using

Wine Wine is an alcoholic drink typically made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol and carbon dioxide, releasing heat in the process. Different varieties of grapes and strains of yeasts are m ...

, though some problems may arise, depending on which version is used.

Via virtualized Windows

It is also possible to use Windows speech recognition software under Linux. Using no-cost

virtualization In computing, virtualization or virtualisation (sometimes abbreviated v12n, a numeronym) is the act of creating a virtual (rather than actual) version of something at the same abstraction level, including virtual computer hardware platforms, stor ...

software, it is possible to run Windows and

NaturallySpeaking Dragon NaturallySpeaking (also known as Dragon for PC, or DNS) is a speech recognition software package developed by Dragon Systems of Newton, Massachusetts, which was acquired in turn by Lernout & Hauspie Speech Products, Nuance Communications ...

under Linux.

VMware Server VMware Server (formerly VMware GSX Server) is a discontinued free-of-charge virtualization-software server suite developed and supplied by VMware, Inc. VMware Server has fewer features than VMware ESX, software available for purchase, but can cr ...

VirtualBox Oracle VM VirtualBox (formerly Sun VirtualBox, Sun xVM VirtualBox and Innotek VirtualBox) is a type-2 hypervisor for x86 virtualization developed by Oracle Corporation. VirtualBox was originally created by Innotek GmbH, which was acquired by S ...

support copy and paste to/from a virtual machine, making dictated text easily transferable to/from the virtual machine.

References

External links

Accessibility, SpeechRecognition – Ubuntu Help
{{DEFAULTSORT:Speech Recognition In Linux Ergonomics GNOME Accessibility Linux audio video-related software Speech recognition