1# TensorFlow Lite Delegates
2
3## Introduction
4
5**Delegates** enable hardware acceleration of TensorFlow Lite models by
6leveraging on-device accelerators such as the GPU and
7[Digital Signal Processor (DSP)](https://en.wikipedia.org/wiki/Digital_signal_processor).
8
9By default, TensorFlow Lite utilizes CPU kernels that are optimized for the
10[ARM Neon](https://developer.arm.com/documentation/dht0002/a/Introducing-NEON/NEON-architecture-overview/NEON-instructions)
11instruction set. However, the CPU is a multi-purpose processor that isn't
12necessarily optimized for the heavy arithmetic typically found in Machine
13Learning models (for example, the matrix math involved in convolution and dense
14layers).
15
16On the other hand, most modern mobile phones contain chips that are better at
17handling these heavy operations. Utilizing them for neural network operations
18provides huge benefits in terms of latency and power efficiency. For example,
19GPUs can provide upto a
20[5x speedup](https://blog.tensorflow.org/2020/08/faster-mobile-gpu-inference-with-opencl.html)
21in latency, while the
22[Qualcomm® Hexagon DSP](https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor)
23has shown to reduce power consumption upto 75% in our experiments.
24
25Each of these accelerators have associated APIs that enable custom computations,
26such as [OpenCL](https://www.khronos.org/opencl/) or
27[OpenGL ES](https://www.khronos.org/opengles/) for mobile GPU and the
28[Qualcomm® Hexagon SDK](https://developer.qualcomm.com/software/hexagon-dsp-sdk)
29for DSP. Typically, you would have to write a lot of custom code to run a neural
30network through these interfaces. Things get even more complicated when you
31consider that each accelerator has its pros & cons and cannot execute every
32operation in a neural network. TensorFlow Lite's Delegate API solves this
33problem by acting as a bridge between the TFLite runtime and these lower-level
34APIs.
35
36![runtime with delegates](images/delegate_runtime.png)
37
38## Choosing a Delegate
39
40TensorFlow Lite supports multiple delegates, each of which is optimized for
41certain platform(s) and particular types of models. Usually, there will be
42multiple delegates applicable to your use-case, depending on two major criteria:
43the *Platform* (Android or iOS?) you target, and the *Model-type*
44(floating-point or quantized?) that you are trying to accelerate.
45
46### Delegates by Platform
47
48#### Cross-platform (Android & iOS)
49
50*   **GPU delegate** - The GPU delegate can be used on both Android and iOS. It
51    is optimized to run 32-bit and 16-bit float based models where a GPU is
52    available. It also supports 8-bit quantized models and provides GPU
53    performance on par with their float versions. For details on the GPU
54    delegate, see [TensorFlow Lite on GPU](gpu_advanced.md). For step-by-step
55    tutorials on using the GPU delegate with Android and iOS, see
56    [TensorFlow Lite GPU Delegate Tutorial](gpu.md).
57
58#### Android
59
60*   **NNAPI delegate for newer Android devices** - The NNAPI delegate can be
61    used to accelerate models on Android devices with GPU, DSP and / or NPU
62    available. It is available in Android 8.1 (API 27+) or higher. For an
63    overview of the NNAPI delegate, step-by-step instructions and best
64    practices, see [TensorFlow Lite NNAPI delegate](nnapi.md).
65*   **Hexagon delegate for older Android devices** - The Hexagon delegate can be
66    used to accelerate models on Android devices with Qualcomm Hexagon DSP. It
67    can be used on devices running older versions of Android that do not support
68    NNAPI. See [TensorFlow Lite Hexagon delegate](hexagon_delegate.md) for more
69    detail.
70
71#### iOS
72
73*   **Core ML delegate for newer iPhones and iPads** - For newer iPhones and
74    iPads where Neural Engine is available, you can use Core ML delegate to
75    accelerate inference for 32-bit or 16-bit floating-point models. Neural
76    Engine is available Apple mobile devices with A12 SoC or higher. For an
77    overview of the Core ML delegate and step-by-step instructions, see
78    [TensorFlow Lite Core ML delegate](coreml_delegate.md).
79
80### Delegates by model type
81
82Each accelerator is designed with a certain bit-width of data in mind. If you
83provide a floating-point model to a delegate that only supports 8-bit quantized
84operations (such as the [Hexagon delegate](hexagon_delegate.md)), it will reject
85all its operations and the model will run entirely on the CPU. To avoid such
86surprises, the table below provides an overview of delegate support based on
87model type:
88
89**Model Type**                                                                                          | **GPU** | **NNAPI** | **Hexagon** | **CoreML**
90------------------------------------------------------------------------------------------------------- | ------- | --------- | ----------- | ----------
91Floating-point (32 bit)                                                                                 | Yes     | Yes       | No          | Yes
92[Post-training float16 quantization](post_training_float16_quant.ipynb)                                 | Yes     | No        | No          | Yes
93[Post-training dynamic range quantization](post_training_quant.ipynb)                                   | Yes     | Yes       | No          | No
94[Post-training integer quantization](post_training_integer_quant.ipynb)                                 | Yes     | Yes       | Yes         | No
95[Quantization-aware training](http://www.tensorflow.org/model_optimization/guide/quantization/training) | Yes     | Yes       | Yes         | No
96
97### Validating performance
98
99The information in this section acts as a rough guideline for shortlisting the
100delegates that could improve your application. However, it is important to note
101that each delegate has a pre-defined set of operations it supports, and may
102perform differently depending on the model and device; for example, the
103[NNAPI delegate](nnapi.md) may choose to use Google's Edge-TPU on a Pixel phone
104while utilizing a DSP on another device. Therefore, it is usually recommended
105that you perform some benchmarking to gauge how useful a delegate is for your
106needs. This also helps justify the binary size increase associated with
107attaching a delegate to the TensorFlow Lite runtime.
108
109TensorFlow Lite has extensive performance and accuracy-evaluation tooling that
110can empower developers to be confident in using delegates in their application.
111These tools are discussed in the next section.
112
113## Tools for Evaluation
114
115### Latency & memory footprint
116
117TensorFlow Lite’s
118[benchmark tool](https://www.tensorflow.org/lite/performance/measurement) can be
119used with suitable parameters to estimate model performance, including average
120inference latency, initialization overhead, memory footprint, etc. This tool
121supports multiple flags to figure out the best delegate configuration for your
122model. For instance, `--gpu_backend=gl` can be specified with `--use_gpu` to
123measure GPU execution with OpenGL. The complete list of supported delegate
124parameters is defined in the
125[detailed documentation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar).
126
127Here’s an example run for a quantized model with GPU via `adb`:
128
129```
130adb shell /data/local/tmp/benchmark_model \
131  --graph=/data/local/tmp/mobilenet_v1_224_quant.tflite \
132  --use_gpu=true
133```
134
135You can download pre-built version of this tool for Android, 64-bit ARM
136architecture
137[here](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_benchmark_model.apk)
138([more details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/android)).
139
140### Accuracy & correctness
141
142Delegates usually perform computations at a different precision than their CPU
143counterparts. As a result, there is an (usually minor) accuracy tradeoff
144associated with utilizing a delegate for hardware acceleration. Note that this
145isn't *always* true; for example, since the GPU uses floating-point precision to
146run quantized models, there might be a slight precision improvement (for e.g.,
147<1% Top-5 improvement in ILSVRC image classification).
148
149TensorFlow Lite has two types of tooling to measure how accurately a delegate
150behaves for a given model: *Task-Based* and *Task-Agnostic*. All the tools
151described in this section support the
152[advanced delegation parameters](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar)
153used by the benchmarking tool from the previous section. Note that the
154sub-sections below focus on *delegate evaluation* (Does the delegate perform the
155same as the CPU?) rather than model evaluation (Is the model itself good for the
156task?).
157
158#### Task-Based Evaluation
159
160TensorFlow Lite has tools to evaluate correctness on two image-based tasks:
161
162*   [ILSVRC 2012](http://image-net.org/challenges/LSVRC/2012/) (Image
163    Classification) with
164    [top-K accuracy](https://en.wikipedia.org/wiki/Evaluation_measures_\(information_retrieval\)#Precision_at_K)
165
166*   [COCO Object Detection (w/ bounding boxes)](https://cocodataset.org/#detection-2020)
167    with
168    [mean Average Precision (mAP)](https://en.wikipedia.org/wiki/Evaluation_measures_\(information_retrieval\)#Mean_average_precision)
169
170Prebuilt binaries of these tools (Android, 64-bit ARM architecture), along with
171documentation can be found here:
172
173*   [ImageNet Image Classification](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_imagenet_image_classification)
174    ([More details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/imagenet_image_classification))
175*   [COCO Object Detection](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_coco_object_detection)
176    ([More details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/coco_object_detection))
177
178The example below demonstrates
179[image classification evaluation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/imagenet_image_classification)
180with NNAPI utilizing Google's Edge-TPU on a Pixel 4:
181
182```
183adb shell /data/local/tmp/run_eval \
184  --model_file=/data/local/tmp/mobilenet_quant_v1_224.tflite \
185  --ground_truth_images_path=/data/local/tmp/ilsvrc_images \
186  --ground_truth_labels=/data/local/tmp/ilsvrc_validation_labels.txt \
187  --model_output_labels=/data/local/tmp/model_output_labels.txt \
188  --output_file_path=/data/local/tmp/accuracy_output.txt \
189  --num_images=0 # Run on all images. \
190  --use_nnapi=true \
191  --nnapi_accelerator_name=google-edgetpu
192```
193
194The expected output is a list of Top-K metrics from 1 to 10:
195
196```
197Top-1 Accuracy: 0.733333
198Top-2 Accuracy: 0.826667
199Top-3 Accuracy: 0.856667
200Top-4 Accuracy: 0.87
201Top-5 Accuracy: 0.89
202Top-6 Accuracy: 0.903333
203Top-7 Accuracy: 0.906667
204Top-8 Accuracy: 0.913333
205Top-9 Accuracy: 0.92
206Top-10 Accuracy: 0.923333
207```
208
209#### Task-Agnostic Evaluation
210
211For tasks where there isn't an established on-device evaluation tool, or if you
212are experimenting with custom models, TensorFlow Lite has the
213[Inference Diff](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/inference_diff)
214tool. (Android, 64-bit ARM binary architecture binary
215[here](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_inference_diff))
216
217Inference Diff compares TensorFlow Lite execution (in terms of latency &
218output-value deviation) in two settings:
219
220*   Single-threaded CPU Inference
221*   User-defined Inference - defined by
222    [these parameters](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar)
223
224To do so, the tool generates random Gaussian data and passes it through two
225TFLite Interpreters - one running single-threaded CPU kernels, and the other
226parameterized by the user's arguments.
227
228It measures the latency of both, as well as the absolute difference between the
229output tensors from each Interpreter, on a per-element basis.
230
231For a model with a single output tensor, the output might look like this:
232
233```
234Num evaluation runs: 50
235Reference run latency: avg=84364.2(us), std_dev=12525(us)
236Test run latency: avg=7281.64(us), std_dev=2089(us)
237OutputDiff[0]: avg_error=1.96277e-05, std_dev=6.95767e-06
238```
239
240What this means is that for the output tensor at index `0`, the elements from
241the CPU output different from the delegate output by an average of `1.96e-05`.
242
243Note that interpreting these numbers requires deeper knowledge of the model, and
244what each output tensor signifies. If its a simple regression that determines
245some sort of score or embedding, the difference should be low (otherwise it's an
246error with the delegate). However, outputs like the 'detection class' one from
247SSD models is a little harder to interpret. For example, it might show a
248difference using this tool, but that may not mean something really wrong with
249the delegate: consider two (fake) classes: "TV (ID: 10)", "Monitor (ID:20)" - If
250a delegate is slightly off the golden truth and shows monitor instead of TV, the
251output diff for this tensor might be something as high as 20-10 = 10.
252