Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
README.md | D | 23-Nov-2023 | 5.9 KiB | 126 | 89 | |
asr_am.svg | D | 23-Nov-2023 | 110.5 KiB | 5 | 2 | |
asr_lm.svg | D | 23-Nov-2023 | 128.6 KiB | 5 | 2 | |
endpointer.svg | D | 23-Nov-2023 | 99.3 KiB | 5 | 2 | |
hotword.svg | D | 23-Nov-2023 | 70.5 KiB | 5 | 2 | |
speakerid.svg | D | 23-Nov-2023 | 42.6 KiB | 5 | 2 | |
tts.svg | D | 23-Nov-2023 | 67.2 KiB | 5 | 2 |
README.md
1## Speech Model Tests 2 3Sample test data has been provided for speech related models in Tensorflow Lite 4to help users working with speech models to verify and test their models. 5 6### Models and Inputs and Outputs: 7 8[ASR AM model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model.tflite) 9 10[ASR AM quantized model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_int8.tflite) 11 12[ASR AM test inputs](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_in.csv) 13 14[ASR AM test outputs](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_out.csv) 15 16[ASR AM int8 test outputs](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_asr_am_model_int8_out.csv) 17 18The models below are not maintained. 19 20[Speech hotword model (Svdf 21rank=1)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank1_2017_11_14.tflite) 22 23[Speech hotword model (Svdf 24rank=2)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank2_2017_11_14.tflite) 25 26[Speaker-id 27model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_speakerid_model_2017_11_14.tflite) 28 29[TTS 30model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_tts_model_2017_11_14.tflite) 31 32### Test Bench 33 34[Model tests](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/models/speech_test.cc) 35 36Download the ASR AM test models and inputs and output files to the 37models/testdata directory to run the tests. 38 39 40## Speech Model Architectures 41 42For the hotword, speaker-id and automatic speech recognition sample models, the 43architecture assumes that the models receive their input from a speech 44pre-processing module. The speech pre-processing module receives the audio 45signal and produces features for the encoder neural network and uses some 46typical signal processing algorithms, like FFT and spectral subtraction, and 47ultimately produces a log-mel filterbank (the log of the triangular mel filters 48applied to the power spectra). The text-to-speech model assumes that the inputs 49are linguistic features describing characteristics of phonemes, syllables, 50words, phrases, and sentence. The outputs are acoustic features including 51mel-cepstral coefficients, log fundamental frequency, and band aperiodicity. 52The pre-processing modules for these models are not provided in the open source 53version of TensorFlow Lite. 54 55The following sections describe the architecture of the sample models at a high 56level: 57 58### Hotword Model 59 60The hotword model is the neural network model we use for keyphrase/hotword 61spotting (i.e. "okgoogle" detection). It is the entry point for voice 62interaction (e.g. Google search app on Android devices or Google Home, etc.). 63The speech hotword model block diagram is shown in Figure below. It has an input 64size of 40 (float), an output size of 7 (float), one Svdf layer, and four fully 65connected layers with the corresponding parameters as shown in figure below. 66 67 68 69### Speaker-id Model 70 71The speaker-id model is the neural network model we use for speaker 72verification. It runs after the hotword triggers. The speech speaker-id model 73block diagram is shown in Figure below. It has an input size of 80 (float), an 74output size of 64 (float), three Lstm layers, and one fully connected layers 75with the corresponding parameters as shown in figure below. 76 77 78 79### Text-to-speech (TTS) Model 80 81The text-to-speech model is the neural network model used to generate speech 82from text. The speech text-to-speech model’s block diagram is shown 83in Figure below. It has and input size of 334 (float), an output size of 196 84(float), two fully connected layers, three Lstm layers, and one recurrent layer 85with the corresponding parameters as shown in the figure. 86 87 88 89### Automatic Speech Recognizer (ASR) Acoustic Model (AM) 90 91The acoustic model for automatic speech recognition is the neural network model 92for matching phonemes to the input audio features. It generates posterior 93probabilities of phonemes from speech frontend features (log-mel filterbanks). 94It has an input size of 320 (float), an output size of 42 (float), five LSTM 95layers and one fully connected layers with a Softmax activation function, with 96the corresponding parameters as shown in the figure. 97 98 99 100### Automatic Speech Recognizer (ASR) Language Model (LM) 101 102The language model for automatic speech recognition is the neural network model 103for predicting the probability of a word given previous words in a sentence. 104It generates posterior probabilities of the next word based from a sequence of 105words. The words are encoded as indices in a fixed size dictionary. 106The model has two inputs both of size one (integer): the current word index and 107next word index, an output size of one (float): the log probability. It consists 108of three embedding layer, three LSTM layers, followed by a multiplication, a 109fully connected layers and an addition. 110The corresponding parameters as shown in the figure. 111 112 113 114### Endpointer Model 115 116The endpointer model is the neural network model for predicting end of speech 117in an utterance. More precisely, it generates posterior probabilities of various 118events that allow detection of speech start and end events. 119It has an input size of 40 (float) which are speech frontend features 120(log-mel filterbanks), and an output size of four corresponding to: 121speech, intermediate non-speech, initial non-speech, and final non-speech. 122The model consists of a convolutional layer, followed by a fully-connected 123layer, two LSTM layers, and two additional fully-connected layers. 124The corresponding parameters as shown in the figure. 125 126