signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, and scientific measurements. Signal processing techniq ...

, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).

Overview

fMLLR transformations are trained in a maximum likelihood sense on adaptation data. These transformations may be estimated in many ways, but only maximum likelihood (ML) estimation is considered in fMLLR. The fMLLR transformation is trained on a particular set of adaptation data, such that it maximizes the likelihood of that adaptation data given a current model-set. This technique is a widely used approach for speaker adaptation in HMM-based

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

. Later research also shows that fMLLR is an excellent acoustic feature for DNN/HMM hybrid speech recognition models. The advantage of fMLLR includes the following: * the adaptation process can be performed within a pre-processing phase, and is independent of the

ASR The Asr prayer ( ar, صلاة العصر ', "afternoon prayer") is one of the five mandatory salah (Islamic prayer). As an Islamic day starts at sunset, the Asr prayer is technically the fifth prayer of the day. If counted from midnight, it is ...

training and decoding process. * this type of adapted feature can be applied to deep neural networks (DNN) to replace traditionally used mel-spectrogram in end-to-end speech recognition models. * fMLLR's speaker adaptation process leads to a significant performance boost for

models, hence outperforming other transform or features like MFCCs (Mel-Frequency Cepstral Coefficients) and FBANKs (Filter bank) coefficients. * fMLLR features can be efficiently realized with speech toolkits like

Kaldi Kaldi or Khalid was a legendary Arab Ethiopian goatherd who discovered the coffee plant around 850 CE, according to popular legend, show some artwork depicting him, after which it entered the Islamic world and then the rest of the world. Story I ...

. Major problem and disadvantage of fMLLR: * when the amount of adaptation data is limited, the transformation matrices tends to easily

overfit mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

the given data.

Computing fMLLR transform

Feature transform of fMLLR can be easily computed with the open source speech tool

, the Kaldi script uses the standard estimation scheme described in Appendix B of the original paper, in particular the section Appendix B.1 "Direct method over rows". In the Kaldi formulation, fMLLR is an affine feature transform of the form

x

→

A

x

+b

, which can be written in the form

x

→W

\hat

, where

\hat

\begin x \\ 1 \end

is the acoustic feature

x

with a 1 appended. Note that this differs from some of the literature where the 1 comes first as

\hat

\begin 1 \\ x \end

. The sufficient statistics stored are:

K=\sum_\gamma_(t)\textstyle \Sigma_^\mu_x(t)^+ \displaystyle

where

\textstyle \Sigma_^ \displaystyle

is the inverse co-variance matrix. And for

0 \leq i \leq D

where

D

is the feature dimension:

G^=\sum_\gamma_(t)\left ( \frac \right )x(t)^+x(t)^ \displaystyle

For a thorough review that explains fMLLR and the commonly used estimation techniques, see the original paper "Maximum likelihood linear transformations for HMM-based speech recognition ". Note that the Kaldi script that performs the feature transforms of fMLLR differs with by using a column of the inverse in place of the cofactor row. In other words, the factor of the determinant is ignored, as it does not affect the transform result and can causes potential danger of numerical underflow or overflow.

Comparing with other features or transforms

Experiment result shows that by using the fMLLR feature in speech recognition, constant improvement is gained over other acoustic features on various commonly used benchmark datasets (

TIMIT TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time. TIMIT was designed to further acoustic-phonetic knowledge and au ...

LibriSpeech
etc). In particular, fMLLR features outperform MFCCs and FBANKs coefficients, which is mainly due to the speaker adaptation process that fMLLR performs. In, phoneme error rate (PER, %) is reported for the test set of

with various neural architectures: As expected, fMLLR features outperform MFCCs and FBANKs coefficients despite the use of different model architecture. Where MLP (multi-layer perceptron) serves as a simple baseline, on the other hand

RNN RNN or rnn may refer to: * Random neural network, a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals * Recurrent neural network, a class of artificial neural networks where connections betw ...

LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) c ...

, and

GRU The Main Directorate of the General Staff of the Armed Forces of the Russian Federation, rus, Гла́вное управле́ние Генера́льного шта́ба Вооружённых сил Росси́йской Федера́ци ...

are all well known recurrent models. The Li-GRU architecture is based on a single gate and thus saves 33% of the computations over a standard GRU model, Li-GRU thus effectively address the gradient vanishing problem of recurrent models. As a result, the best performance is obtained with the Li-GRU model on fMLLR features.

Extract fMLLR features with Kaldi

fMLLR can be extracted as reported in the s5 recipe of Kaldi. Kaldi scripts can certainly extract fMLLR features on different dataset, below are the basic example steps to extract fMLLR features from the open source speech corpor
Librispeech
Note that the instructions below are for the subsets train-clean-100,train-clean-360,dev-clean, and test-clean, but they can be easily extended to support the other sets dev-other, test-other, and train-other-500. # These instruction are based on the codes provided in thi
GitHub repository
which contains Kaldi recipes on the LibriSpeech corpora to execute the fMLLR feature extraction process, replace the files under $KALDI_ROOT/egs/librispeech/s5/ with the files in the repository. # Install

. # Instal
Kaldiio
# If running on a single machine, change the following lines in $KALDI_ROOT/egs/librispeech/s5/cmd.sh to replace queue.pl to run.pl: export train_cmd="run.pl --mem 2G" export decode_cmd="run.pl --mem 4G" export mkgraph_cmd="run.pl --mem 8G" # Change the data path in run.sh to your LibriSpeech data path, the directory LibriSpeech/ should be under that path. For example: data=/media/user/SSD # example path # Install flac with: sudo apt-get install flac # Run the Kaldi recipe run.sh for LibriSpeech at least until Stage 13 (included), for simplicity you can used the modifie
run.sh
# Copy exp/tri4b/trans.* files into exp/tri4b/decode_tgsmall_train_clean_*/ with the following command: mkdir exp/tri4b/decode_tgsmall_train_clean_100 && cp exp/tri4b/trans.* exp/tri4b/decode_tgsmall_train_clean_100/ # Compute the fMLLR features by running the following script, the script can also be downloade
here
#!/bin/bash . ./cmd.sh ## You'll want to change cmd.sh to something that will work on your system. . ./path.sh ## Source the tools/utils (import the queue.pl) gmmdir=exp/tri4b for chunk in dev_clean test_clean train_clean_100 train_clean_360 ; do dir=fmllr/$chunk steps/nnet/make_fmllr_feats.sh --nj 10 --cmd "$train_cmd" \ --transform-dir $gmmdir/decode_tgsmall_$chunk \ $dir data/$chunk $gmmdir $dir/log $dir/data , , exit 1 compute-cmvn-stats --spk2utt=ark:data/$chunk/spk2utt scp:fmllr/$chunk/feats.scp ark:$dir/data/cmvn_speaker.ark done # Compute alignments using: # alignments on dev_clean and test_clean steps/align_fmllr.sh --nj 10 data/dev_clean data/lang exp/tri4b exp/tri4b_ali_dev_clean steps/align_fmllr.sh --nj 10 data/test_clean data/lang exp/tri4b exp/tri4b_ali_test_clean steps/align_fmllr.sh --nj 30 data/train_clean_100 data/lang exp/tri4b exp/tri4b_ali_clean_100 steps/align_fmllr.sh --nj 30 data/train_clean_360 data/lang exp/tri4b exp/tri4b_ali_clean_360 # Apply CMVN and dump the fMLLR features to new .ark files, the script can also be downloade
here
#!/bin/bash data=/user/kaldi/egs/librispeech/s5 ## You'll want to change this path to something that will work on your system. rm -rf $data/fmllr_cmvn/ mkdir $data/fmllr_cmvn/ for part in dev_clean test_clean train_clean_100 train_clean_360; do mkdir $data/fmllr_cmvn/$part/ apply-cmvn --utt2spk=ark:$data/fmllr/$part/utt2spk ark:$data/fmllr/$part/data/cmvn_speaker.ark scp:$data/fmllr/$part/feats.scp ark:- , add-deltas --delta-order=0 ark:- ark:$data/fmllr_cmvn/$part/fmllr_cmvn.ark done du -sh $data/fmllr_cmvn/* echo "Done!" # Use the Python script to convert Kaldi generated .ark features to .npy for your own dataloader, an exampl
Python script
is provided: python ark2libri.py

References

{{Reflist Speech recognition Automatic identification and data capture

Overview

Computing fMLLR transform

Comparing with other features or transforms

Extract fMLLR features with Kaldi

See also

References