1# TensorFlow Lite Delegates 2 3## Introduction 4 5**Delegates** enable hardware acceleration of TensorFlow Lite models by 6leveraging on-device accelerators such as the GPU and 7[Digital Signal Processor (DSP)](https://en.wikipedia.org/wiki/Digital_signal_processor). 8 9By default, TensorFlow Lite utilizes CPU kernels that are optimized for the 10[ARM Neon](https://developer.arm.com/documentation/dht0002/a/Introducing-NEON/NEON-architecture-overview/NEON-instructions) 11instruction set. However, the CPU is a multi-purpose processor that isn't 12necessarily optimized for the heavy arithmetic typically found in Machine 13Learning models (for example, the matrix math involved in convolution and dense 14layers). 15 16On the other hand, most modern mobile phones contain chips that are better at 17handling these heavy operations. Utilizing them for neural network operations 18provides huge benefits in terms of latency and power efficiency. For example, 19GPUs can provide upto a 20[5x speedup](https://blog.tensorflow.org/2020/08/faster-mobile-gpu-inference-with-opencl.html) 21in latency, while the 22[Qualcomm® Hexagon DSP](https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor) 23has shown to reduce power consumption upto 75% in our experiments. 24 25Each of these accelerators have associated APIs that enable custom computations, 26such as [OpenCL](https://www.khronos.org/opencl/) or 27[OpenGL ES](https://www.khronos.org/opengles/) for mobile GPU and the 28[Qualcomm® Hexagon SDK](https://developer.qualcomm.com/software/hexagon-dsp-sdk) 29for DSP. Typically, you would have to write a lot of custom code to run a neural 30network through these interfaces. Things get even more complicated when you 31consider that each accelerator has its pros & cons and cannot execute every 32operation in a neural network. TensorFlow Lite's Delegate API solves this 33problem by acting as a bridge between the TFLite runtime and these lower-level 34APIs. 35 36![runtime with delegates](images/delegate_runtime.png) 37 38## Choosing a Delegate 39 40TensorFlow Lite supports multiple delegates, each of which is optimized for 41certain platform(s) and particular types of models. Usually, there will be 42multiple delegates applicable to your use-case, depending on two major criteria: 43the *Platform* (Android or iOS?) you target, and the *Model-type* 44(floating-point or quantized?) that you are trying to accelerate. 45 46### Delegates by Platform 47 48#### Cross-platform (Android & iOS) 49 50* **GPU delegate** - The GPU delegate can be used on both Android and iOS. It 51 is optimized to run 32-bit and 16-bit float based models where a GPU is 52 available. It also supports 8-bit quantized models and provides GPU 53 performance on par with their float versions. For details on the GPU 54 delegate, see [TensorFlow Lite on GPU](gpu_advanced.md). For step-by-step 55 tutorials on using the GPU delegate with Android and iOS, see 56 [TensorFlow Lite GPU Delegate Tutorial](gpu.md). 57 58#### Android 59 60* **NNAPI delegate for newer Android devices** - The NNAPI delegate can be 61 used to accelerate models on Android devices with GPU, DSP and / or NPU 62 available. It is available in Android 8.1 (API 27+) or higher. For an 63 overview of the NNAPI delegate, step-by-step instructions and best 64 practices, see [TensorFlow Lite NNAPI delegate](nnapi.md). 65* **Hexagon delegate for older Android devices** - The Hexagon delegate can be 66 used to accelerate models on Android devices with Qualcomm Hexagon DSP. It 67 can be used on devices running older versions of Android that do not support 68 NNAPI. See [TensorFlow Lite Hexagon delegate](hexagon_delegate.md) for more 69 detail. 70 71#### iOS 72 73* **Core ML delegate for newer iPhones and iPads** - For newer iPhones and 74 iPads where Neural Engine is available, you can use Core ML delegate to 75 accelerate inference for 32-bit or 16-bit floating-point models. Neural 76 Engine is available Apple mobile devices with A12 SoC or higher. For an 77 overview of the Core ML delegate and step-by-step instructions, see 78 [TensorFlow Lite Core ML delegate](coreml_delegate.md). 79 80### Delegates by model type 81 82Each accelerator is designed with a certain bit-width of data in mind. If you 83provide a floating-point model to a delegate that only supports 8-bit quantized 84operations (such as the [Hexagon delegate](hexagon_delegate.md)), it will reject 85all its operations and the model will run entirely on the CPU. To avoid such 86surprises, the table below provides an overview of delegate support based on 87model type: 88 89**Model Type** | **GPU** | **NNAPI** | **Hexagon** | **CoreML** 90------------------------------------------------------------------------------------------------------- | ------- | --------- | ----------- | ---------- 91Floating-point (32 bit) | Yes | Yes | No | Yes 92[Post-training float16 quantization](post_training_float16_quant.ipynb) | Yes | No | No | Yes 93[Post-training dynamic range quantization](post_training_quant.ipynb) | Yes | Yes | No | No 94[Post-training integer quantization](post_training_integer_quant.ipynb) | Yes | Yes | Yes | No 95[Quantization-aware training](http://www.tensorflow.org/model_optimization/guide/quantization/training) | Yes | Yes | Yes | No 96 97### Validating performance 98 99The information in this section acts as a rough guideline for shortlisting the 100delegates that could improve your application. However, it is important to note 101that each delegate has a pre-defined set of operations it supports, and may 102perform differently depending on the model and device; for example, the 103[NNAPI delegate](nnapi.md) may choose to use Google's Edge-TPU on a Pixel phone 104while utilizing a DSP on another device. Therefore, it is usually recommended 105that you perform some benchmarking to gauge how useful a delegate is for your 106needs. This also helps justify the binary size increase associated with 107attaching a delegate to the TensorFlow Lite runtime. 108 109TensorFlow Lite has extensive performance and accuracy-evaluation tooling that 110can empower developers to be confident in using delegates in their application. 111These tools are discussed in the next section. 112 113## Tools for Evaluation 114 115### Latency & memory footprint 116 117TensorFlow Lite’s 118[benchmark tool](https://www.tensorflow.org/lite/performance/measurement) can be 119used with suitable parameters to estimate model performance, including average 120inference latency, initialization overhead, memory footprint, etc. This tool 121supports multiple flags to figure out the best delegate configuration for your 122model. For instance, `--gpu_backend=gl` can be specified with `--use_gpu` to 123measure GPU execution with OpenGL. The complete list of supported delegate 124parameters is defined in the 125[detailed documentation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar). 126 127Here’s an example run for a quantized model with GPU via `adb`: 128 129``` 130adb shell /data/local/tmp/benchmark_model \ 131 --graph=/data/local/tmp/mobilenet_v1_224_quant.tflite \ 132 --use_gpu=true 133``` 134 135You can download pre-built version of this tool for Android, 64-bit ARM 136architecture 137[here](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_benchmark_model.apk) 138([more details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/android)). 139 140### Accuracy & correctness 141 142Delegates usually perform computations at a different precision than their CPU 143counterparts. As a result, there is an (usually minor) accuracy tradeoff 144associated with utilizing a delegate for hardware acceleration. Note that this 145isn't *always* true; for example, since the GPU uses floating-point precision to 146run quantized models, there might be a slight precision improvement (for e.g., 147<1% Top-5 improvement in ILSVRC image classification). 148 149TensorFlow Lite has two types of tooling to measure how accurately a delegate 150behaves for a given model: *Task-Based* and *Task-Agnostic*. All the tools 151described in this section support the 152[advanced delegation parameters](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar) 153used by the benchmarking tool from the previous section. Note that the 154sub-sections below focus on *delegate evaluation* (Does the delegate perform the 155same as the CPU?) rather than model evaluation (Is the model itself good for the 156task?). 157 158#### Task-Based Evaluation 159 160TensorFlow Lite has tools to evaluate correctness on two image-based tasks: 161 162* [ILSVRC 2012](http://image-net.org/challenges/LSVRC/2012/) (Image 163 Classification) with 164 [top-K accuracy](https://en.wikipedia.org/wiki/Evaluation_measures_\(information_retrieval\)#Precision_at_K) 165 166* [COCO Object Detection (w/ bounding boxes)](https://cocodataset.org/#detection-2020) 167 with 168 [mean Average Precision (mAP)](https://en.wikipedia.org/wiki/Evaluation_measures_\(information_retrieval\)#Mean_average_precision) 169 170Prebuilt binaries of these tools (Android, 64-bit ARM architecture), along with 171documentation can be found here: 172 173* [ImageNet Image Classification](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_imagenet_image_classification) 174 ([More details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/imagenet_image_classification)) 175* [COCO Object Detection](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_coco_object_detection) 176 ([More details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/coco_object_detection)) 177 178The example below demonstrates 179[image classification evaluation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/imagenet_image_classification) 180with NNAPI utilizing Google's Edge-TPU on a Pixel 4: 181 182``` 183adb shell /data/local/tmp/run_eval \ 184 --model_file=/data/local/tmp/mobilenet_quant_v1_224.tflite \ 185 --ground_truth_images_path=/data/local/tmp/ilsvrc_images \ 186 --ground_truth_labels=/data/local/tmp/ilsvrc_validation_labels.txt \ 187 --model_output_labels=/data/local/tmp/model_output_labels.txt \ 188 --output_file_path=/data/local/tmp/accuracy_output.txt \ 189 --num_images=0 # Run on all images. \ 190 --use_nnapi=true \ 191 --nnapi_accelerator_name=google-edgetpu 192``` 193 194The expected output is a list of Top-K metrics from 1 to 10: 195 196``` 197Top-1 Accuracy: 0.733333 198Top-2 Accuracy: 0.826667 199Top-3 Accuracy: 0.856667 200Top-4 Accuracy: 0.87 201Top-5 Accuracy: 0.89 202Top-6 Accuracy: 0.903333 203Top-7 Accuracy: 0.906667 204Top-8 Accuracy: 0.913333 205Top-9 Accuracy: 0.92 206Top-10 Accuracy: 0.923333 207``` 208 209#### Task-Agnostic Evaluation 210 211For tasks where there isn't an established on-device evaluation tool, or if you 212are experimenting with custom models, TensorFlow Lite has the 213[Inference Diff](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks/inference_diff) 214tool. (Android, 64-bit ARM binary architecture binary 215[here](https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_eval_inference_diff)) 216 217Inference Diff compares TensorFlow Lite execution (in terms of latency & 218output-value deviation) in two settings: 219 220* Single-threaded CPU Inference 221* User-defined Inference - defined by 222 [these parameters](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar) 223 224To do so, the tool generates random Gaussian data and passes it through two 225TFLite Interpreters - one running single-threaded CPU kernels, and the other 226parameterized by the user's arguments. 227 228It measures the latency of both, as well as the absolute difference between the 229output tensors from each Interpreter, on a per-element basis. 230 231For a model with a single output tensor, the output might look like this: 232 233``` 234Num evaluation runs: 50 235Reference run latency: avg=84364.2(us), std_dev=12525(us) 236Test run latency: avg=7281.64(us), std_dev=2089(us) 237OutputDiff[0]: avg_error=1.96277e-05, std_dev=6.95767e-06 238``` 239 240What this means is that for the output tensor at index `0`, the elements from 241the CPU output different from the delegate output by an average of `1.96e-05`. 242 243Note that interpreting these numbers requires deeper knowledge of the model, and 244what each output tensor signifies. If its a simple regression that determines 245some sort of score or embedding, the difference should be low (otherwise it's an 246error with the delegate). However, outputs like the 'detection class' one from 247SSD models is a little harder to interpret. For example, it might show a 248difference using this tool, but that may not mean something really wrong with 249the delegate: consider two (fake) classes: "TV (ID: 10)", "Monitor (ID:20)" - If 250a delegate is slightly off the golden truth and shows monitor instead of TV, the 251output diff for this tensor might be something as high as 20-10 = 10. 252