HOME

TheInfoList



OR:

Neurocomputational speech processing is computer-simulation of
speech production Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vo ...
and
speech perception Speech perception is the process by which the sounds of language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by wh ...
by referring to the natural neuronal processes of
speech production Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vo ...
and
speech perception Speech perception is the process by which the sounds of language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by wh ...
, as they occur in the human
nervous system In biology, the nervous system is the highly complex part of an animal that coordinates its actions and sensory information by transmitting signals to and from different parts of its body. The nervous system detects environmental changes th ...
(
central nervous system The central nervous system (CNS) is the part of the nervous system consisting primarily of the brain and spinal cord. The CNS is so named because the brain integrates the received information and coordinates and influences the activity of all par ...
and
peripheral nervous system The peripheral nervous system (PNS) is one of two components that make up the nervous system of bilateral animals, with the other part being the central nervous system (CNS). The PNS consists of nerves and ganglia, which lie outside the brain ...
). This topic is based on
neuroscience Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions and disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, development ...
and
computational neuroscience Computational neuroscience (also known as theoretical neuroscience or mathematical neuroscience) is a branch of neuroscience which employs mathematical models, computer simulations, theoretical analysis and abstractions of the brain to u ...
.


Overview

Neurocomputational models of speech processing are complex. They comprise at least a cognitive part, a motor part and a sensory part. The cognitive or linguistic part of a neurocomputational model of speech processing comprises the neural activation or generation of a phonemic representation on the side of
speech production Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vo ...
(e.g. neurocomputational and extended version of the Levelt model developed by Ardi Roelofs: WEAVER++ as well as the neural activation or generation of an intention or meaning on the side of
speech perception Speech perception is the process by which the sounds of language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by wh ...
or
speech comprehension Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and percept ...
. The motor part of a neurocomputational model of speech processing starts with a phonemic representation of a speech item, activates a motor plan and ends with the articulation of that particular speech item (see also:
articulatory phonetics The field of articulatory phonetics is a subfield of phonetics that studies articulation and ways that humans produce speech. Articulatory phoneticians explain how humans produce speech sounds via the interaction of different physiological struc ...
). The sensory part of a neurocomputational model of speech processing starts with an acoustic signal of a speech item ( acoustic speech signal), generates an auditory representation for that signal and activates a phonemic representations for that speech item.


Neurocomputational speech processing topics

Neurocomputational speech processing is speech processing by
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
. Neural maps, mappings and pathways as described below, are model structures, i.e. important structures within artificial neural networks.


Neural maps

An artificial neural network can be separated in three types of neural maps, also called "layers": # input maps (in the case of speech processing: primary auditory map within the
auditory cortex The auditory cortex is the part of the temporal lobe that processes auditory information in humans and many other vertebrates. It is a part of the auditory system, performing basic and higher functions in hearing, such as possible relations to ...
, primary somatosensory map within the
somatosensory cortex In physiology, the somatosensory system is the network of neural structures in the brain and body that produce the perception of touch (haptic perception), as well as temperature (thermoception), body position (proprioception), and pain. It is ...
), # output maps (primary motor map within the primary
motor cortex The motor cortex is the region of the cerebral cortex believed to be involved in the planning, control, and execution of voluntary movements. The motor cortex is an area of the frontal lobe located in the posterior precentral gyrus immediately a ...
), and # higher level cortical maps (also called "hidden layers"). The term "neural map" is favoured here over the term "neural layer", because a cortical neural map should be modeled as a 2D-map of interconnected neurons (e.g. like a
self-organizing map A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher dimensional data set while preserving the to ...
; see also Fig. 1). Thus, each "model neuron" or "
artificial neuron An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing e ...
" within this 2D-map is physiologically represented by a
cortical column A cortical column is a group of neurons forming a cylindrical structure through the cerebral cortex of the brain perpendicular to the cortical surface. The structure was first identified by Mountcastle in 1957. He later identified cortical minicolu ...
since the
cerebral cortex The cerebral cortex, also known as the cerebral mantle, is the outer layer of neural tissue of the cerebrum of the brain in humans and other mammals. The cerebral cortex mostly consists of the six-layered neocortex, with just 10% consisting of ...
anatomically exhibits a layered structure.


Neural representations (neural states)

A neural representation within an
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
is a temporarily activated (neural) state within a specific neural map. Each neural state is represented by a specific neural activation pattern. This activation pattern changes during speech processing (e.g. from syllable to syllable). In the ACT model (see below), it is assumed that an auditory state can be represented by a "neural
spectrogram A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represen ...
" (see Fig. 2) within an auditory state map. This auditory state map is assumed to be located in the auditory association cortex (see
cerebral cortex The cerebral cortex, also known as the cerebral mantle, is the outer layer of neural tissue of the cerebrum of the brain in humans and other mammals. The cerebral cortex mostly consists of the six-layered neocortex, with just 10% consisting of ...
). A somatosensory state can be divided in a
tactile Tactile may refer to: * Tactile, related to the sense of touch * Haptics (disambiguation) * Tactile (device), a text-to-braille translation device See also * Tangibility, in law * Somatosensory system, where sensations are processed * CD96 CD ...
and proprioceptive state and can be represented by a specific neural activation pattern within the somatosensory state map. This state map is assumed to be located in the somatosensory association cortex (see
cerebral cortex The cerebral cortex, also known as the cerebral mantle, is the outer layer of neural tissue of the cerebrum of the brain in humans and other mammals. The cerebral cortex mostly consists of the six-layered neocortex, with just 10% consisting of ...
,
somatosensory system In physiology, the somatosensory system is the network of neural structures in the brain and body that produce the perception of touch (haptic perception), as well as temperature (thermoception), body position (proprioception), and pain. It ...
,
somatosensory cortex In physiology, the somatosensory system is the network of neural structures in the brain and body that produce the perception of touch (haptic perception), as well as temperature (thermoception), body position (proprioception), and pain. It is ...
). A motor plan state can be assumed for representing a motor plan, i.e. the planning of speech articulation for a specific syllable or for a longer speech item (e.g. word, short phrase). This state map is assumed to be located in the
premotor cortex The premotor cortex is an area of the motor cortex lying within the frontal lobe of the brain just anterior to the primary motor cortex. It occupies part of Brodmann's area 6. It has been studied mainly in primates, including monkeys and humans. ...
, while the instantaneous (or lower level) activation of each speech articulator occurs within the
primary motor cortex The primary motor cortex (Brodmann area 4) is a brain region that in humans is located in the dorsal portion of the frontal lobe. It is the primary region of the motor system and works in association with other motor areas including premotor co ...
(see
motor cortex The motor cortex is the region of the cerebral cortex believed to be involved in the planning, control, and execution of voluntary movements. The motor cortex is an area of the frontal lobe located in the posterior precentral gyrus immediately a ...
). The neural representations occurring in the sensory and motor maps (as introduced above) are distributed representations (Hinton et al. 1968): Each neuron within the sensory or motor map is more or less activated, leading to a specific activation pattern. The neural representation for speech units occurring in the speech sound map (see below: DIVA model) is a punctual or local representation. Each speech item or speech unit is represented here by a specific
neuron A neuron, neurone, or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous tissue in all animals except sponges and placozoa. N ...
(model cell, see below).


Neural mappings (synaptic projections)

A neural mapping connects two cortical neural maps. Neural mappings (in contrast to neural pathways) store training information by adjusting their neural link weights (see
artificial neuron An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing e ...
,
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
). Neural mappings are capable of generating or activating a distributed representation (see above) of a sensory or motor state within a sensory or motor map from a punctual or local activation within the other map (see for example the synaptic projection from speech sound map to motor map, to auditory target region map, or to somatosensory target region map in the DIVA model, explained below; or see for example the neural mapping from phonetic map to auditory state map and motor plan state map in the ACT model, explained below and Fig. 3). Neural mapping between two neural maps are compact or dense: Each neuron of one neural map is interconnected with (nearly) each neuron of the other neural map (many-to-many-connection, see
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
). Because of this density criterion for neural mappings, neural maps which are interconnected by a neural mapping are not far apart from each other.


Neural pathways

In contrast to neural mappings
neural pathway In neuroanatomy, a neural pathway is the connection formed by axons that project from neurons to make synapses onto neurons in another location, to enable neurotransmission (the sending of a signal from one region of the nervous system to ano ...
s can connect neural maps which are far apart (e.g. in different cortical lobes, see
cerebral cortex The cerebral cortex, also known as the cerebral mantle, is the outer layer of neural tissue of the cerebrum of the brain in humans and other mammals. The cerebral cortex mostly consists of the six-layered neocortex, with just 10% consisting of ...
). From the functional or modeling viewpoint, neural pathways mainly forward information without processing this information. A neural pathway in comparison to a neural mapping need much less neural connections. A neural pathway can be modelled by using a one-to-one connection of the neurons of both neural maps (see
topographic map In modern mapping, a topographic map or topographic sheet is a type of map characterized by large- scale detail and quantitative representation of relief features, usually using contour lines (connecting points of equal elevation), but historic ...
ping and see
somatotopic arrangement Somatotopy is the point-for-point correspondence of an area of the body to a specific point on the central nervous system. Typically, the area of the body corresponds to a point on the primary somatosensory cortex (postcentral gyrus). This cortex i ...
). Example: In the case of two neural maps, each comprising 1,000 model neurons, a neural mapping needs up to 1,000,000 neural connections (many-to-many-connection), while only 1,000 connections are needed in the case of a neural pathway connection. Furthermore, the link weights of the connections within a neural mapping are adjusted during training, while the neural connections in the case of a neural pathway need not to be trained (each connection is maximal exhibitory).


DIVA model

The leading approach in neurocomputational modeling of speech production is the DIVA model developed by Frank H. Guenther and his group at Boston University. The model accounts for a wide range of
phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...
and
neuroimaging Neuroimaging is the use of quantitative (computational) techniques to study the structure and function of the central nervous system, developed as an objective way of scientifically studying the healthy human brain in a non-invasive manner. Incre ...
data but - like each neurocomputational model - remains speculative to some extent.


Structure of the model

The organization or structure of the DIVA model is shown in Fig. 4.


Speech sound map: the phonemic representation as a starting point

The speech sound map - assumed to be located in the inferior and posterior portion of
Broca's area Broca's area, or the Broca area (, also , ), is a region in the frontal lobe of the dominant Cerebral hemisphere, hemisphere, usually the left, of the Human brain, brain with functions linked to speech production. Language processing in the brai ...
(left frontal operculum) - represents (phonologically specified) language-specific speech units (sounds, syllables, words, short phrases). Each speech unit (mainly syllables; e.g. the syllable and word "palm" /pam/, the syllables /pa/, /ta/, /ka/, ...) is represented by a specific model cell within the speech sound map (i.e. punctual neural representations, see above). Each model cell (see
artificial neuron An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing e ...
) corresponds to a small population of neurons which are located at close range and which fire together.


Feedforward control: activating motor representations

Each neuron (model cell,
artificial neuron An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing e ...
) within the speech sound map can be activated and subsequently activates a forward motor command towards the motor map, called articulatory velocity and position map. The activated neural representation on the level of that motor map determines the articulation of a speech unit, i.e. controls all articulators (lips, tongue, velum, glottis) during the time interval for producing that speech unit. Forward control also involves subcortical structures like the
cerebellum The cerebellum (Latin for "little brain") is a major feature of the hindbrain of all vertebrates. Although usually smaller than the cerebrum, in some animals such as the mormyrid fishes it may be as large as or even larger. In humans, the cerebel ...
, not modelled in detail here. A speech ''unit'' represents an amount of speech ''items'' which can be assigned to the same phonemic category. Thus, each speech unit is represented by one specific neuron within the speech sound map, while the realization of a speech unit may exhibit some articulatory and acoustic variability. This phonetic variability is the motivation to define sensory target ''regions'' in the DIVA model (see Guenther et al. 1998).


Articulatory model: generating somatosensory and auditory feedback information

The activation pattern within the motor map determines the movement pattern of all model articulators (lips, tongue, velum, glottis) for a speech item. In order not to overload the model, no detailed modeling of the neuromuscular system is done. The Maeda articulatory speech synthesizer is used in order to generate articulator movements, which allows the generation of a time-varying vocal tract form and the generation of the acoustic speech signal for each particular speech item. In terms of
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
the articulatory model can be called plant (i.e. the system, which is controlled by the brain); it represents a part of the embodiment of the neuronal speech processing system. The articulatory model generates sensory output which is the basis for generating feedback information for the DIVA model (see below: feedback control).


Feedback control: sensory target regions, state maps, and error maps

On the one hand the articulatory model generates
sensory information A sense is a biological system used by an organism for sensation, the process of gathering information about the world through the detection of stimuli. (For example, in the human body, the brain which is part of the central nervous system rec ...
, i.e. an auditory state for each speech unit which is neurally represented within the auditory state map (distributed representation), and a somatosensory state for each speech unit which is neurally represented within the somatosensory state map (distributed representation as well). The auditory state map is assumed to be located in the superior temporal cortex while the somatosensory state map is assumed to be located in the inferior parietal cortex. On the other hand, the speech sound map, if activated for a specific speech unit (single neuron activation; punctual activation), activates sensory information by synaptic projections between speech sound map and auditory target region map and between speech sound map and somatosensory target region map. Auditory and somatosensory target regions are assumed to be located in higher-order auditory cortical regions and in higher-order somatosensory cortical regions respectively. These target region sensory activation patterns - which exist for each speech unit - are learned during
speech acquisition Speech acquisition focuses on the development of vocal, acoustic and oral language by a child. This includes motor planning and execution, pronunciation, phonological and articulation patterns (as opposed to content and grammar which is language). ...
(by imitation training; see below: learning). Consequently, two types of sensory information are available if a speech unit is activated at the level of the speech sound map: (i) learned sensory target regions (i.e. ''intended'' sensory state for a speech unit) and (ii) sensory state activation patterns resulting from a possibly imperfect execution (articulation) of a specific speech unit (i.e. ''current'' sensory state, reflecting the current production and articulation of that particular speech unit). Both types of sensory information is projected to sensory error maps, i.e. to an auditory error map which is assumed to be located in the superior temporal cortex (like the auditory state map) and to a somatosensory error map which is assumed to be located in the inferior parietal cortex (like the somatosensory state map) (see Fig. 4). If the current sensory state deviates from the intended sensory state, both error maps are generating feedback commands which are projected towards the motor map and which are capable to correct the motor activation pattern and subsequently the articulation of a speech unit under production. Thus, in total, the activation pattern of the motor map is not only influenced by a specific feedforward command learned for a speech unit (and generated by the synaptic projection from the speech sound map) but also by a feedback command generated at the level of the sensory error maps (see Fig. 4).


Learning (modeling speech acquisition)

While the ''structure'' of a neuroscientific model of speech processing (given in Fig. 4 for the DIVA model) is mainly determined by evolutionary processes, the (language-specific) ''knowledge'' as well as the (language-specific) ''speaking skills'' are learned and trained during
speech acquisition Speech acquisition focuses on the development of vocal, acoustic and oral language by a child. This includes motor planning and execution, pronunciation, phonological and articulation patterns (as opposed to content and grammar which is language). ...
. In the case of the DIVA model it is assumed that the newborn has not available an already structured (language-specific) speech sound map; i.e. no neuron within the speech sound map is related to any speech unit. Rather the organization of the speech sound map as well as the tuning of the projections to the motor map and to the sensory target region maps is learned or trained during speech acquisition. Two important phases of early speech acquisition are modeled in the DIVA approach: Learning by
babbling Babbling is a stage in child development and a state in language acquisition during which an infant appears to be experimenting with uttering articulate sounds, but does not yet produce any recognizable words. Babbling begins shortly after birth ...
and by
imitation Imitation (from Latin ''imitatio'', "a copying, imitation") is a behavior whereby an individual observes and replicates another's behavior. Imitation is also a form of that leads to the "development of traditions, and ultimately our culture. I ...
.


Babbling

During
babbling Babbling is a stage in child development and a state in language acquisition during which an infant appears to be experimenting with uttering articulate sounds, but does not yet produce any recognizable words. Babbling begins shortly after birth ...
the synaptic projections between sensory error maps and motor map are tuned. This training is done by generating an amount of semi-random feedforward commands, i.e. the DIVA model "babbles". Each of these babbling commands leads to the production of an "articulatory item", also labeled as "pre-linguistic (i.e. non language-specific) speech item" (i.e. the articulatory model generates an articulatory movement pattern on the basis of the babbling motor command). Subsequently, an acoustic signal is generated. On the basis of the articulatory and acoustic signal, a specific auditory and somatosensory state pattern is activated at the level of the sensory state maps (see Fig. 4) for each (pre-linguistic) speech item. At this point the DIVA model has available the sensory and associated motor activation pattern for different speech items, which enables the model to tune the synaptic projections between sensory error maps and motor map. Thus, during babbling the DIVA model learns feedback commands (i.e. how to produce a proper (feedback) motor command for a specific sensory input).


Imitation

During
imitation Imitation (from Latin ''imitatio'', "a copying, imitation") is a behavior whereby an individual observes and replicates another's behavior. Imitation is also a form of that leads to the "development of traditions, and ultimately our culture. I ...
the DIVA model organizes its speech sound map and tunes the synaptic projections between speech sound map and motor map - i.e. tuning of forward motor commands - as well as the synaptic projections between speech sound map and sensory target regions (see Fig. 4). Imitation training is done by exposing the model to an amount of acoustic speech signals representing realizations of language-specific speech units (e.g. isolated speech sounds, syllables, words, short phrases). The tuning of the synaptic projections between speech sound map and auditory target region map is accomplished by assigning one neuron of the speech sound map to the phonemic representation of that speech item and by associating it with the auditory representation of that speech item, which is activated at the auditory target region map. Auditory ''regions'' (i.e. a specification of the auditory variability of a speech unit) occur, because one specific speech unit (i.e. one specific phonemic representation) can be realized by several (slightly) different acoustic (auditory) realizations (for the difference between speech ''item'' and speech ''unit'' see above: feedforward control) . The tuning of the synaptic projections between speech sound map and motor map (i.e. tuning of forward motor commands) is accomplished with the aid of feedback commands, since the projections between sensory error maps and motor map were already tuned during babbling training (see above). Thus the DIVA model tries to "imitate" an auditory speech item by attempting to find a proper feedforward motor command. Subsequently, the model compares the resulting sensory output (''current'' sensory state following the articulation of that attempt) with the already learned auditory target region (''intended'' sensory state) for that speech item. Then the model updates the current feedforward motor command by the current feedback motor command generated from the auditory error map of the auditory feedback system. This process may be repeated several times (several attempts). The DIVA model is capable of producing the speech item with a decreasing auditory difference between current and intended auditory state from attempt to attempt. During imitation the DIVA model is also capable of tuning the synaptic projections from speech sound map to somatosensory target region map, since each new imitation attempt produces a new articulation of the speech item and thus produces a
somatosensory In physiology, the somatosensory system is the network of neural structures in the brain and body that produce the perception of touch (haptic perception), as well as temperature (thermoception), body position (proprioception), and pain. It is ...
state pattern which is associated with the phonemic representation of that speech item.


Perturbation experiments


Real-time perturbation of F1: the influence of auditory feedback

While auditory feedback is most important during speech acquisition, it may be activated less if the model has learned a proper feedforward motor command for each speech unit. But it has been shown that auditory feedback needs to be strongly coactivated in the case of auditory perturbation (e.g. shifting a formant frequency, Tourville et al. 2005). This is comparable to the strong influence of visual feedback on reaching movements during visual perturbation (e.g. shifting the location of objects by viewing through a
prism Prism usually refers to: * Prism (optics), a transparent optical component with flat surfaces that refract light * Prism (geometry), a kind of polyhedron Prism may also refer to: Science and mathematics * Prism (geology), a type of sedimentary ...
).


Unexpected blocking of the jaw: the influence of somatosensory feedback

In a comparable way to auditory feedback, also somatosensory feedback can be strongly coactivated during speech production, e.g. in the case of unexpected blocking of the jaw (Tourville et al. 2005).


ACT model

A further approach in neurocomputational modeling of speech processing is the ACT model developed by Bernd J. Kröger and his group at
RWTH Aachen University RWTH Aachen University (), also known as North Rhine-Westphalia Technical University of Aachen, Rhine-Westphalia Technical University of Aachen, Technical University of Aachen, University of Aachen, or ''Rheinisch-Westfälische Technische Hoch ...
, Germany (Kröger et al. 2014, Kröger et al. 2009, Kröger et al. 2011). The ACT model is in accord with the DIVA model in large parts. The ACT model focuses on the "
action Action may refer to: * Action (narrative), a literary mode * Action fiction, a type of genre fiction * Action game, a genre of video game Film * Action film, a genre of film * ''Action'' (1921 film), a film by John Ford * ''Action'' (1980 fil ...
repository" (i.e.
repository Repository may refer to: Archives and online databases * Content repository, a database with an associated set of data management tools, allowing application-independent access to the content * Disciplinary repository (or subject repository), an ...
for sensorimotor speaking skills, comparable to the mental syllabary, see Levelt and Wheeldon 1994), which is not spelled out in detail in the DIVA model. Moreover, the ACT model explicitly introduces a level of motor plans, i.e. a high-level motor description for the production of speech items (see motor goals,
motor cortex The motor cortex is the region of the cerebral cortex believed to be involved in the planning, control, and execution of voluntary movements. The motor cortex is an area of the frontal lobe located in the posterior precentral gyrus immediately a ...
). The ACT model - like any neurocomputational model - remains speculative to some extent.


Structure

The organization or structure of the ACT model is given in Fig. 5. For
speech production Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vo ...
, the ACT model starts with the activation of a phonemic representation of a speech item (phonemic map). In the case of a ''frequent
syllable A syllable is a unit of organization for a sequence of speech sounds typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). Syllables are often considered the phonological "bu ...
'', a co-activation occurs at the level of the phonetic map, leading to a further co-activation of the intended sensory state at the level of the sensory state maps and to a co-activation of a motor plan state at the level of the motor plan map. In the case of an ''infrequent syllable'', an attempt for a motor plan is generated by the motor planning module for that speech item by activating motor plans for phonetic similar speech items via the phonetic map (see Kröger et al. 2011). The motor plan or vocal tract action score comprises temporally overlapping vocal tract actions, which are programmed and subsequently executed by the motor programming, execution, and control module. This module gets real-time somatosensory feedback information for controlling the correct execution of the (intended) motor plan. Motor programing leads to activation pattern at the level lof the primary motor map and subsequently activates neuromuscular processing. Motoneuron activation patterns generate muscle forces and subsequently movement patterns of all model articulators (lips, tongue, velum, glottis). The Birkholz 3D articulatory synthesizer is used in order to generate the acoustic speech signal. Articulatory and acoustic feedback signals are used for generating
somatosensory In physiology, the somatosensory system is the network of neural structures in the brain and body that produce the perception of touch (haptic perception), as well as temperature (thermoception), body position (proprioception), and pain. It is ...
and auditory feedback information via the sensory preprocessing modules, which is forwarded towards the auditory and somatosensory map. At the level of the sensory-phonetic processing modules, auditory and somatosensory information is stored in
short-term memory Short-term memory (or "primary" or "active memory") is the capacity for holding a small amount of information in an active, readily available state for a short interval. For example, short-term memory holds a phone number that has just been recit ...
and the external sensory signal (ES, Fig. 5, which are activated via the sensory feedback loop) can be compared with the already trained sensory signals (TS, Fig. 5, which are activated via the phonetic map). Auditory and somatosensory error signals can be generated if external and intended (trained) sensory signals are noticeably different (cf. DIVA model). The light green area in Fig. 5 indicates those neural maps and processing modules, which process a
syllable A syllable is a unit of organization for a sequence of speech sounds typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). Syllables are often considered the phonological "bu ...
as a whole unit (specific processing time window around 100 ms and more). This processing comprises the phonetic map and the directly connected sensory state maps within the sensory-phonetic processing modules and the directly connected motor plan state map, while the primary motor map as well as the (primary) auditory and (primary) somatosensory map process smaller time windows (around 10 ms in the ACT model). The hypothetical cortical location of neural maps within the ACT model is shown in Fig. 6. The hypothetical locations of primary motor and primary sensory maps are given in magenta, the hypothetical locations of motor plan state map and sensory state maps (within sensory-phonetic processing module, comparable to the error maps in DIVA) are given in orange, and the hypothetical locations for the
mirrored ''Mirrored'' is the debut studio album by American experimental rock band Battles. It was released on May 14, 2007 in the United Kingdom, and on May 22, 2007 in the United States. ''Mirrored'' marked the first album in which the band incorporated ...
phonetic map is given in red. Double arrows indicate neuronal mappings. Neural mappings connect neural maps, which are not far apart from each other (see above). The two
mirrored ''Mirrored'' is the debut studio album by American experimental rock band Battles. It was released on May 14, 2007 in the United Kingdom, and on May 22, 2007 in the United States. ''Mirrored'' marked the first album in which the band incorporated ...
locations of the phonetic map are connected via a neural pathway (see above), leading to a (simple) one-to-one mirroring of the current activation pattern for both realizations of the phonetic map. This neural pathway between the two locations of the phonetic map is assumed to be a part of the fasciculus arcuatus (AF, see Fig. 5 and Fig. 6). For
speech perception Speech perception is the process by which the sounds of language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by wh ...
, the model starts with an external acoustic signal (e.g. produced by an external speaker). This signal is preprocessed, passes the auditory map, and leads to an activation pattern for each syllable or word on the level of the auditory-phonetic processing module (ES: external signal, see Fig. 5). The ventral path of speech perception (see Hickok and Poeppel 2007) would directly activate a lexical item, but is not implemented in ACT. Rather, in ACT the activation of a phonemic state occurs via the phonemic map and thus may lead to a coactivation of motor representations for that speech item (i.e. dorsal pathway of speech perception; ibid.).


Action repository

The phonetic map together with the motor plan state map, sensory state maps (occurring within the sensory-phonetic processing modules), and phonemic (state) map form the action repository. The phonetic map is implemented in ACT as a self-organizing neural map and different speech items are represented by different neurons within this map (punctual or local representation, see above: neural representations). The phonetic map exhibits three major characteristics: * More than one phonetic realization may occur within the phonetic map for one phonemic state (see phonemic link weights in Fig. 7: e.g. the syllable /de:m/ is represented by three neurons within the phonetic map) * Phonetotopy: The phonetic map exhibits an ordering of speech items with respect to different phonetic features (see phonemic link weights in Fig. 7. Three examples: (i) the syllables /p@/, /t@/, and /k@/ occur in an upward ordering at the left side within the phonetic map; (ii) syllable-initial plosives occur in the upper left part of the phonetic map while syllable initial fricatives occur in the lower right half; (iii) CV syllables and CVC syllables as well occur in different areas of the phonetic map.). * The phonetic map is hypermodal or multimodal: The activation of a phonetic item at the level of the phonetic map coactivates (i) a phonemic state (see phonemic link weights in Fig. 7), (ii) a motor plan state (see motor plan link weights in Fig. 7), (iii) an auditory state (see auditory link weights in Fig. 7), and (iv) a somatosensory state (not shown in Fig. 7). All these states are learned or trained during speech acquisition by tuning the synaptic link weights between each neuron within the phonetic map, representing a particular phonetic state and all neurons within the associated motor plan and sensory state maps (see also Fig. 3). The phonetic map implements the action-perception-link within the ACT model (see also Fig. 5 and Fig. 6: the dual neural representation of the phonetic map in the
frontal lobe The frontal lobe is the largest of the four major lobes of the brain in mammals, and is located at the front of each cerebral hemisphere (in front of the parietal lobe and the temporal lobe). It is parted from the parietal lobe by a groove betwe ...
and at the intersection of
temporal lobe The temporal lobe is one of the four Lobes of the brain, major lobes of the cerebral cortex in the brain of mammals. The temporal lobe is located beneath the lateral fissure on both cerebral hemispheres of the mammalian brain. The temporal lobe ...
and
parietal lobe The parietal lobe is one of the four major lobes of the cerebral cortex in the brain of mammals. The parietal lobe is positioned above the temporal lobe and behind the frontal lobe and central sulcus. The parietal lobe integrates sensory informa ...
).


Motor plans

A motor plan is a high level motor description for the production and articulation of a speech items (see motor goals,
motor skills A motor skill is a function that involves specific movements of the body's muscles to perform a certain task. These tasks could include walking, running, or riding a bike. In order to perform this skill, the body's nervous system, muscles, and br ...
,
articulatory phonetics The field of articulatory phonetics is a subfield of phonetics that studies articulation and ways that humans produce speech. Articulatory phoneticians explain how humans produce speech sounds via the interaction of different physiological struc ...
,
articulatory phonology Articulatory phonology is a linguistic theory originally proposed in 1986 by Catherine Browman of Haskins Laboratories and Louis Goldstein of University of Southern California and Haskins. The theory identifies theoretical discrepancies between phon ...
). In our neurocomputational model ACT a motor plan is quantified as a vocal tract action score. Vocal tract action scores quantitatively determine the number of vocal tract actions (also called articulatory gestures), which need to be activated in order to produce a speech item, their degree of realization and duration, and the temporal organization of all vocal tract actions building up a speech item (for a detailed description of vocal tract actions scores see e.g. Kröger & Birkholz 2007). The detailed realization of each vocal tract action (articulatory gesture) depends on the temporal organization of all vocal tract actions building up a speech item and especially on their temporal overlap. Thus the detailed realization of each vocal tract action within an speech item is specified below the motor plan level in our neurocomputational model ACT (see Kröger et al. 2011).


Integrating sensorimotor and cognitive aspects: the coupling of action repository and mental lexicon

A severe problem of phonetic or sensorimotor models of speech processing (like DIVA or ACT) is that the development of the phonemic map during speech acquisition is not modeled. A possible solution of this problem could be a direct coupling of action repository and mental lexicon without explicitly introducing a phonemic map at the beginning of speech acquisition (even at the beginning of imitation training; see Kröger et al. 2011 PALADYN Journal of Behavioral Robotics).


Experiments: speech acquisition

A very important issue for all neuroscientific or neurocomputational approaches is to separate structure and knowledge. While the structure of the model (i.e. of the human neuronal network, which is needed for processing speech) is mainly determined by evolutionary processes, the knowledge is gathered mainly during
speech acquisition Speech acquisition focuses on the development of vocal, acoustic and oral language by a child. This includes motor planning and execution, pronunciation, phonological and articulation patterns (as opposed to content and grammar which is language). ...
by processes of
learning Learning is the process of acquiring new understanding, knowledge, behaviors, skills, value (personal and cultural), values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machine learning, machines ...
. Different learning experiments were carried out with the model ACT in order to learn (i) a five-vowel system /i, e, a, o, u/ (see Kröger et al. 2009), (ii) a small consonant system (voiced plosives /b, d, g/ in combination with all five vowels acquired earlier as CV syllables (ibid.), (iii) a small model language comprising the five-vowel system, voiced and unvoiced plosives /b, d, g, p, t, k/, nasals /m, n/ and the lateral /l/ and three syllable types (V, CV, and CCV) (see Kröger et al. 2011) and (iv) the 200 most frequent syllables of Standard German for a 6-year-old child (see Kröger et al. 2011). In all cases, an ordering of phonetic items with respect to different phonetic features can be observed.


Experiments: speech perception

Despite the fact that the ACT model in its earlier versions was designed as a pure speech production model (including speech acquisition), the model is capable of exhibiting important basic phenomena of speech perception, i.e. categorical perception and the McGurk effect. In the case of
categorical perception Categorical perception is a phenomenon of perception of distinct categories when there is a gradual change in a variable along a continuum. It was originally observed for auditory stimuli but now found to be applicable to other perceptual modalit ...
, the model is able to exhibit that categorical perception is stronger in the case of plosives than in the case of vowels (see Kröger et al. 2009). Furthermore, the model ACT was able to exhibit the McGurk effect, if a specific mechanism of inhibition of neurons of the level of the phonetic map was implemented (see Kröger and Kannampuzha 2008).Kröger BJ, Kannampuzha J (2008) A neurofunctional model of speech production including aspects of auditory and audio-visual speech perception. ''Proceedings of the International Conference on Audio-Visual Speech Processing 2008'' (Moreton Island, Queensland, Australia) pp. 83–88


See also

*
Speech production Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vo ...
*
Speech perception Speech perception is the process by which the sounds of language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by wh ...
*
Computational neuroscience Computational neuroscience (also known as theoretical neuroscience or mathematical neuroscience) is a branch of neuroscience which employs mathematical models, computer simulations, theoretical analysis and abstractions of the brain to u ...
*
Articulatory synthesis Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which us ...
*
Auditory feedback Auditory feedback (AF) is an aid used by humans to control speech production and singing by helping the individual verify whether the current production of speech or singing is in accordance with his acoustic-auditory intention. This process is po ...


References

{{Reflist, 2


Further reading


Iaroslav Blagouchine and Eric Moreau. ''Control of a Speech Robot via an Optimum Neural-Network-Based Internal Model with Constraints.'' IEEE Transactions on Robotics, vol. 26, no. 1, pp. 142—159, February 2010.
Computational neuroscience Speech processing