• Home
  • History
  • Annotate
Name
Date
Size
#Lines
LOC

..--

README.mdD23-Nov-20235.9 KiB12689

asr_am.svgD23-Nov-2023110.5 KiB52

asr_lm.svgD23-Nov-2023128.6 KiB52

endpointer.svgD23-Nov-202399.3 KiB52

hotword.svgD23-Nov-202370.5 KiB52

speakerid.svgD23-Nov-202342.6 KiB52

tts.svgD23-Nov-202367.2 KiB52

README.md

1## Speech Model Tests
2
3Sample test data has been provided for speech related models in Tensorflow Lite
4to help users working with speech models to verify and test their models.
5
6### Models and Inputs and Outputs:
7
8[ASR AM model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model.tflite)
9
10[ASR AM quantized model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_int8.tflite)
11
12[ASR AM test inputs](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_in.csv)
13
14[ASR AM test outputs](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_out.csv)
15
16[ASR AM int8 test outputs](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_int8_out.csv)
17
18The models below are not maintained.
19
20[Speech hotword model (Svdf
21rank=1)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank1_2017_11_14.tflite)
22
23[Speech hotword model (Svdf
24rank=2)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank2_2017_11_14.tflite)
25
26[Speaker-id
27model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_speakerid_model_2017_11_14.tflite)
28
29[TTS
30model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_tts_model_2017_11_14.tflite)
31
32### Test Bench
33
34[Model tests](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/models/speech_test.cc)
35
36Download the ASR AM test models and inputs and output files to the
37models/testdata directory to run the tests.
38
39
40## Speech Model Architectures
41
42For the hotword, speaker-id and automatic speech recognition sample models, the
43architecture assumes that the models receive their input from a speech
44pre-processing module. The speech pre-processing module receives the audio
45signal and produces features for the encoder neural network and uses some
46typical signal processing algorithms, like FFT and spectral subtraction, and
47ultimately produces a log-mel filterbank (the log of the triangular mel filters
48applied to the power spectra). The text-to-speech model assumes that the inputs
49are linguistic features describing characteristics of phonemes, syllables,
50words, phrases, and sentence. The outputs are acoustic features including
51mel-cepstral coefficients, log fundamental frequency, and band aperiodicity.
52The pre-processing modules for these models are not provided in the open source
53version of TensorFlow Lite.
54
55The following sections describe the architecture of the sample models at a high
56level:
57
58### Hotword Model
59
60The hotword model is the neural network model we use for keyphrase/hotword
61spotting (i.e. "okgoogle" detection). It is the entry point for voice
62interaction (e.g. Google search app on Android devices or Google Home, etc.).
63The speech hotword model block diagram is shown in Figure below. It has an input
64size of 40 (float), an output size of 7 (float), one Svdf layer, and four fully
65connected layers with the corresponding parameters as shown in figure below.
66
67![hotword_model](hotword.svg "Hotword model")
68
69### Speaker-id Model
70
71The speaker-id model is the neural network model we use for speaker
72verification. It runs after the hotword triggers. The speech speaker-id model
73block diagram is shown in Figure below. It has an input size of 80 (float), an
74output size of 64 (float), three Lstm layers, and one fully connected layers
75with the corresponding parameters as shown in figure below.
76
77![speakerid_model](speakerid.svg "Speaker-id model")
78
79### Text-to-speech (TTS) Model
80
81The text-to-speech model is the neural network model used to generate speech
82from text. The speech text-to-speech model’s block diagram is shown
83in Figure below. It has and input size of 334 (float), an output size of 196
84(float), two fully connected layers, three Lstm layers, and one recurrent layer
85with the corresponding parameters as shown in the figure.
86
87![tts_model](tts.svg "TTS model")
88
89### Automatic Speech Recognizer (ASR) Acoustic Model (AM)
90
91The acoustic model for automatic speech recognition is the neural network model
92for matching phonemes to the input audio features. It generates posterior
93probabilities of phonemes from speech frontend features (log-mel filterbanks).
94It has an input size of 320 (float), an output size of 42 (float), five LSTM
95layers and one fully connected layers with a Softmax activation function, with
96the corresponding parameters as shown in the figure.
97
98![asr_am_model](asr_am.svg "ASR AM model")
99
100### Automatic Speech Recognizer (ASR) Language Model (LM)
101
102The language model for automatic speech recognition is the neural network model
103for predicting the probability of a word given previous words in a sentence.
104It generates posterior probabilities of the next word based from a sequence of
105words. The words are encoded as indices in a fixed size dictionary.
106The model has two inputs both of size one (integer): the current word index and
107next word index, an output size of one (float): the log probability. It consists
108of three embedding layer, three LSTM layers, followed by a multiplication, a
109fully connected layers and an addition.
110The corresponding parameters as shown in the figure.
111
112![asr_lm_model](asr_lm.svg "ASR LM model")
113
114### Endpointer Model
115
116The endpointer model is the neural network model for predicting end of speech
117in an utterance. More precisely, it generates posterior probabilities of various
118events that allow detection of speech start and end events.
119It has an input size of 40 (float) which are speech frontend features
120(log-mel filterbanks), and an output size of four corresponding to:
121speech, intermediate non-speech, initial non-speech, and final non-speech.
122The model consists of a convolutional layer, followed by a fully-connected
123layer, two LSTM layers, and two additional fully-connected layers.
124The corresponding parameters as shown in the figure.
125![endpointer_model](endpointer.svg "Endpointer model")
126