A spoken dialog system (SDS) is a computer system able to converse with a human with voice. It has two essential components that do not exist in a written text
dialog system
A dialogue system, or conversational agent (CA), is a computer system intended to converse with a human. Dialogue systems employed one or more of text, speech, graphics, haptics, gestures, and other modes for communication on both the input and o ...
: a
speech recognizer
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the mai ...
and a
text-to-speech
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
module (written text dialog systems usually use other input systems provided by an OS). It can be further distinguished from command and control speech systems that can respond to requests but do not attempt to maintain continuity over time.
Components
* An automatic
Speech recognizer
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the mai ...
(ASR) decodes speech into text. Domain-specific recognizers can be configured for language designed for a given application. A "cloud" recognizer will be suitable for domains that do not depend on very specific vocabularies.
*
Natural language understanding
Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an ...
transforms a recognition into a concept structure that can drive system behavior. Some approaches will combine recognition and understanding processing but are thought to be less flexible since interpretation has to be coded into the grammar.
* The
dialog manager
A dialog manager (DM) is a component of a dialog system (DS), responsible for the state and flow of the conversation. Usually:
* The input to the DM is the human utterance, usually converted to some system-specific semantic representation by the N ...
controls turn-by-turn behavior. A simple dialog system may ask the user questions then act on the response. Such directed dialog systems use a tree-like structure for control; frame- (or form-) based systems allow for some user initiative and accommodate different styles of interaction. More sophisticated dialog managers incorporate mechanisms for dealing with misunderstandings and clarification.
* The domain reasoner, or more simply the back-end, makes use of a
knowledge base
A knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. The initial use of the term was in connection with expert systems, which were the first knowledge-based systems.
Ori ...
to retrieve information and helps formulate system responses. In simple systems, this may be a database which is queried using information collected through the dialog. The domain reasoner, together with the dialog manager, maintain the context of interaction and allows the system to reflect some human conversational abilities (for example using anaphora).
* Response generation is similar to text-based
natural language generation
Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics tha ...
, but takes into account the needs of spoken communication. This might include the use of simpler grammatical constructions, managing the amount of information in any one output utterance and introducing prosodic markers to help the human participant absorb information more easily. A complete system design will also introduce elements of
lexical entrainment
Lexical entrainment is the phenomenon in conversational linguistics of the process of the subject adopting the reference terms of their interlocutor. In practice, it acts as a mechanism of the cooperative principle in which both parties to the con ...
, to encourage the human user to favor certain ways of speaking, which in turn can improve recognition performance.
*
Text-to-speech
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
synthesis (TTS) realizes an intended utterance as speech. Depending on the application, TTS may be based on concatenation of pre-recorded material produced by voice professionals. In more complex applications TTS will use more flexible techniques that accommodate large vocabularies and that allow the developer control over the character ("personality") of the system.
Varieties of systems
Spoken dialog systems vary in their complexity. Directed dialog systems are very simple and require that the developer create a graph (typically a tree) that manages the task but may not correspond to the needs of the user. Information access systems, typically based on forms, allow users some flexibility (for example in the order in which retrieval constraints are specified, or in the use of optional constraints) but are limited in their capabilities. Problem-solving dialog systems may allow human users to engage in a number of different activities that may include information access, plan construction and possible execution of the latter.
Some examples of systems include:
* Information access: Weather, trains schedules, stock quotes, directory assistance.
* Transactional: credit card and bank enquiries; ticket purchases.
* Maintenance: Technical support including documentation access and diagnostic testing.
* Tutoring: For education, such as physics or math, and language learning.
* Entertainment and chatting
History
Pionieers in dialogue systems are companies like
AT&T
AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile tel ...
(with its speech recognizer system in the Seventies) and
CSELT
Centro Studi e Laboratori Telecomunicazioni (CSELT) was an Italian research center for telecommunication based in Torino, the biggest in Italy and one of the most important in Europe. It played a major role internationally especially in the stand ...
laboratories, that led some European research projects during the Eighties (e.g. SUNDIAL) after the end of the DARPA project in the US.
References
The field of spoken dialog systems is quite large and includes research (featured at scientific conferences such as
SIGdial and
Interspeech) and a large industrial sector (with its own meetings such as
SpeechTek and
AVIOS
International Consolidated Airlines Group S.A., trading as International Airlines Group and usually shortened to IAG, is an Anglo-Spanish multinational airline holding company with its registered office in Madrid, Spain, and its global headqua ...
).
The following might provide good technical introductions:
* Michael F. McTear
Spoken Dialogue Technology* Gabriel Skantze,
', 2007: chapter 2
Spoken dialogue systems
* Pirani, Giancarlo, ed. Advanced algorithms and architectures for speech understanding. Vol. 1. Springer Science & Business Media, 2013. {{ISBN, 978-3-540-53402-0
Speech recognition
Multimodal interaction
User interfaces
User interface techniques
Computational linguistics