Deep learning

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...

uses

Deep Neural Networks Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

(DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder). The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text. Some DNN-based speech synthesizers are approaching the naturalness of the human voice.

Formulation

Given an input text or some sequence of linguistic unit

Y

, the target speech

X

can be derived by

X=\arg\max P(X, Y, \theta)

where

\theta

is the model parameter. Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the

Loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

is typically L1 or L2 loss. These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:

loss=\alpha \text_ + (1 - \alpha) \text_

where

\text_

is the loss from human voice band and

\alpha

is a scalar typically around 0.5. The acoustic feature is typically

Spectrogram A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represen ...

or spectrogram in

Mel scale The mel scale (after the word '' melody'') is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perc ...

. These features capture the time-frequency relation of speech signal and thus, it is sufficient to generate intelligent outputs with these acoustic features. The

Mel-frequency cepstrum In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Mel-frequency cepstral coe ...

feature used in the speech recognition task is not suitable for speech synthesis because it reduces too much information.

Brief history

In September 2016,

DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...

proposed

WaveNet WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices ...

, a deep generative model of raw audio waveforms, demonstrating that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered too be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original. In early 2017, Mila propose
char2wav
a model to produce raw waveform in an end-to-end method. In the same year,

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

and

Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin M ...

propose
Tacotron
an
VoiceLoop
respectively, to generate acoustic features directly from the input text; months later, Google propose
Tacotron2
which combined the WaveNet vocoder with the revised Tacotron architecture to perform end-to-end speech synthesis. Tacotron2 can generate high-quality speech approaching the human voice. Since then, end-to-end methods have become the hottest research topic because many researchers around the world have started to notice the power of end-to-end speech synthesizers.

Semi-supervised learning

Currently,

self-supervised learning Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabelled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL metho ...

has gained much attention through better use of unlabelled data. Research has shown that, with the aid of self-supervised loss, the need for paired data decreases.

Zero-shot speaker adaptation

Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings. The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.

Neural vocoder

In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The

model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform

\mathbf=\

as a product of conditional probabilities as follows

p_(\mathbf)=\prod_^p(x_t, x_1,...,x_)

where

\theta

is the model parameter including many dilated convolution layers. Thus, each audio sample

x_t

is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained by knowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile,

Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...

proposed a flow-based WaveGlow model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN, which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.

References

{{Speech synthesis Speech synthesis Applications of artificial intelligence Assistive technology Auditory displays Computational linguistics History of human–computer interaction