1# TensorFlow Lite GPU delegate
2
3[TensorFlow Lite](https://www.tensorflow.org/lite) supports several hardware
4accelerators. This document describes how to use the GPU backend using the
5TensorFlow Lite delegate APIs on Android and iOS.
6
7GPUs are designed to have high throughput for massively parallelizable
8workloads. Thus, they are well-suited for deep neural nets, which consist of a
9huge number of operators, each working on some input tensor(s) that can be
10easily divided into smaller workloads and carried out in parallel, typically
11resulting in lower latency. In the best scenario, inference on the GPU may now
12run fast enough for previously not available real-time applications.
13
14Unlike CPUs, GPUs compute with 16-bit or 32-bit floating point numbers and do
15not require quantization for optimal performance. The delegate does accept 8-bit
16quantized models, but the calculation will be performed in floating point
17numbers. Refer to the [advanced documentation](gpu_advanced.md) for details.
18
19Another benefit with GPU inference is its power efficiency. GPUs carry out the
20computations in a very efficient and optimized manner, so that they consume less
21power and generate less heat than when the same task is run on CPUs.
22
23## Demo app tutorials
24
25The easiest way to try out the GPU delegate is to follow the below tutorials,
26which go through building our classification demo applications with GPU support.
27The GPU code is only binary for now; it will be open-sourced soon. Once you
28understand how to get our demos working, you can try this out on your own custom
29models.
30
31### Android (with Android Studio)
32
33For a step-by-step tutorial, watch the
34[GPU Delegate for Android](https://youtu.be/Xkhgre8r5G0) video.
35
36Note: This requires OpenCL or OpenGL ES (3.1 or higher).
37
38#### Step 1. Clone the TensorFlow source code and open it in Android Studio
39
40```sh
41git clone https://github.com/tensorflow/tensorflow
42```
43
44#### Step 2. Edit `app/build.gradle` to use the nightly GPU AAR
45
46Add the `tensorflow-lite-gpu` package alongside the existing `tensorflow-lite`
47package in the existing `dependencies` block.
48
49```
50dependencies {
51    ...
52    implementation 'org.tensorflow:tensorflow-lite:2.3.0'
53    implementation 'org.tensorflow:tensorflow-lite-gpu:2.3.0'
54}
55```
56
57#### Step 3. Build and run
58
59Run → Run ‘app’. When you run the application you will see a button for enabling
60the GPU. Change from quantized to a float model and then click GPU to run on the
61GPU.
62
63![running android gpu demo and switch to gpu](images/android_gpu_demo.gif)
64
65### iOS (with XCode)
66
67For a step-by-step tutorial, watch the
68[GPU Delegate for iOS](https://youtu.be/a5H4Zwjp49c) video.
69
70Note: This requires XCode v10.1 or later.
71
72#### Step 1. Get the demo source code and make sure it compiles.
73
74Follow our iOS Demo App [tutorial](https://www.tensorflow.org/lite/demo_ios).
75This will get you to a point where the unmodified iOS camera demo is working on
76your phone.
77
78#### Step 2. Modify the Podfile to use the TensorFlow Lite GPU CocoaPod
79
80From 2.3.0 release, by default GPU delegate is excluded from the pod to reduce
81the binary size. You can include them by specifying subspec. For
82`TensorFlowLiteSwift` pod:
83
84```ruby
85pod 'TensorFlowLiteSwift/Metal', '~> 0.0.1-nightly',
86```
87
88OR
89
90```ruby
91pod 'TensorFlowLiteSwift', '~> 0.0.1-nightly', :subspecs => ['Metal']
92```
93
94You can do similarly for `TensorFlowLiteObjC` or `TensorFlowLitC` if you want to
95use the Objective-C (from 2.4.0 release) or C API.
96
97<div>
98  <devsite-expandable>
99    <h4 class="showalways">Before 2.3.0 release</h4>
100    <h4>Until TensorFlow Lite 2.0.0</h4>
101    <p>
102      We have built a binary CocoaPod that includes the GPU delegate. To switch
103      the project to use it, modify the
104      `tensorflow/tensorflow/lite/examples/ios/camera/Podfile` file to use the
105      `TensorFlowLiteGpuExperimental` pod instead of `TensorFlowLite`.
106    </p>
107    <pre class="prettyprint lang-ruby notranslate" translate="no"><code>
108    target 'YourProjectName'
109      # pod 'TensorFlowLite', '1.12.0'
110      pod 'TensorFlowLiteGpuExperimental'
111    </code></pre>
112    <h4>Until TensorFlow Lite 2.2.0</h4>
113    <p>
114      From TensorFlow Lite 2.1.0 to 2.2.0, GPU delegate is included in the
115      `TensorFlowLiteC` pod. You can choose between `TensorFlowLiteC` and
116      `TensorFlowLiteSwift` depending on the language.
117    </p>
118  </devsite-expandable>
119</div>
120
121#### Step 3. Enable the GPU delegate
122
123To enable the code that will use the GPU delegate, you will need to change
124`TFLITE_USE_GPU_DELEGATE` from 0 to 1 in `CameraExampleViewController.h`.
125
126```c
127#define TFLITE_USE_GPU_DELEGATE 1
128```
129
130#### Step 4. Build and run the demo app
131
132After following the previous step, you should be able to run the app.
133
134#### Step 5. Release mode
135
136While in Step 4 you ran in debug mode, to get better performance, you should
137change to a release build with the appropriate optimal Metal settings. In
138particular, To edit these settings go to the `Product > Scheme > Edit
139Scheme...`. Select `Run`. On the `Info` tab, change `Build Configuration`, from
140`Debug` to `Release`, uncheck `Debug executable`.
141
142![setting up release](images/iosdebug.png)
143
144Then click the `Options` tab and change `GPU Frame Capture` to `Disabled` and
145`Metal API Validation` to `Disabled`.
146
147![setting up metal options](images/iosmetal.png)
148
149Lastly make sure to select Release-only builds on 64-bit architecture. Under
150`Project navigator -> tflite_camera_example -> PROJECT -> tflite_camera_example
151-> Build Settings` set `Build Active Architecture Only > Release` to Yes.
152
153![setting up release options](images/iosrelease.png)
154
155## Trying the GPU delegate on your own model
156
157### Android
158
159Note: The TensorFlow Lite Interpreter must be created on the same thread as
160where it is run. Otherwise, `TfLiteGpuDelegate Invoke: GpuDelegate must run on
161the same thread where it was initialized.` may occur.
162
163There are two ways to invoke model acceleration depending on if you are using
164[Android Studio ML Model Binding](../inference_with_metadata/codegen#acceleration)
165or TensorFlow Lite Interpreter.
166
167#### TensorFlow Lite Interpreter
168
169Look at the demo to see how to add the delegate. In your application, add the
170AAR as above, import `org.tensorflow.lite.gpu.GpuDelegate` module, and use
171the`addDelegate` function to register the GPU delegate to the interpreter:
172
173<div>
174  <devsite-selector>
175    <section>
176      <h3>Kotlin</h3>
177      <p><pre class="prettyprint lang-kotlin">
178    import org.tensorflow.lite.Interpreter
179    import org.tensorflow.lite.gpu.CompatibilityList
180    import org.tensorflow.lite.gpu.GpuDelegate
181
182    val compatList = CompatibilityList()
183
184    val options = Interpreter.Options().apply{
185        if(compatList.isDelegateSupportedOnThisDevice){
186            // if the device has a supported GPU, add the GPU delegate
187            val delegateOptions = compatList.bestOptionsForThisDevice
188            this.addDelegate(GpuDelegate(delegateOptions))
189        } else {
190            // if the GPU is not supported, run on 4 threads
191            this.setNumThreads(4)
192        }
193    }
194
195    val interpreter = Interpreter(model, options)
196
197    // Run inference
198    writeToInput(input)
199    interpreter.run(input, output)
200    readFromOutput(output)
201      </pre></p>
202    </section>
203    <section>
204      <h3>Java</h3>
205      <p><pre class="prettyprint lang-java">
206    import org.tensorflow.lite.Interpreter;
207    import org.tensorflow.lite.gpu.CompatibilityList;
208    import org.tensorflow.lite.gpu.GpuDelegate;
209
210    // Initialize interpreter with GPU delegate
211    Interpreter.Options options = new Interpreter.Options();
212    CompatibilityList compatList = CompatibilityList();
213
214    if(compatList.isDelegateSupportedOnThisDevice()){
215        // if the device has a supported GPU, add the GPU delegate
216        GpuDelegate.Options delegateOptions = compatList.getBestOptionsForThisDevice();
217        GpuDelegate gpuDelegate = new GpuDelegate(delegateOptions);
218        options.addDelegate(gpuDelegate);
219    } else {
220        // if the GPU is not supported, run on 4 threads
221        options.setNumThreads(4);
222    }
223
224    Interpreter interpreter = new Interpreter(model, options);
225
226    // Run inference
227    writeToInput(input);
228    interpreter.run(input, output);
229    readFromOutput(output);
230      </pre></p>
231    </section>
232  </devsite-selector>
233</div>
234
235### iOS
236
237Note: GPU delegate can also use C API for Objective-C code. Prior to TensorFlow
238Lite 2.4.0 release, this was the only option.
239
240<div>
241  <devsite-selector>
242    <section>
243      <h3>Swift</h3>
244      <p><pre class="prettyprint lang-swift">
245    import TensorFlowLite
246
247    // Load model ...
248
249    // Initialize TensorFlow Lite interpreter with the GPU delegate.
250    let delegate = MetalDelegate()
251    if let interpreter = try Interpreter(modelPath: modelPath,
252                                         delegates: [delegate]) {
253      // Run inference ...
254    }
255      </pre></p>
256    </section>
257    <section>
258      <h3>Objective-C</h3>
259      <p><pre class="prettyprint lang-objc">
260    // Import module when using CocoaPods with module support
261    @import TFLTensorFlowLite;
262
263    // Or import following headers manually
264    #import "tensorflow/lite/objc/apis/TFLMetalDelegate.h"
265    #import "tensorflow/lite/objc/apis/TFLTensorFlowLite.h"
266
267    // Initialize GPU delegate
268    TFLMetalDelegate* metalDelegate = [[TFLMetalDelegate alloc] init];
269
270    // Initialize interpreter with model path and GPU delegate
271    TFLInterpreterOptions* options = [[TFLInterpreterOptions alloc] init];
272    NSError* error = nil;
273    TFLInterpreter* interpreter = [[TFLInterpreter alloc]
274                                    initWithModelPath:modelPath
275                                              options:options
276                                            delegates:@[ metalDelegate ]
277                                                error:&amp;error];
278    if (error != nil) { /* Error handling... */ }
279
280    if (![interpreter allocateTensorsWithError:&amp;error]) { /* Error handling... */ }
281    if (error != nil) { /* Error handling... */ }
282
283    // Run inference ...
284    ```
285      </pre></p>
286    </section>
287    <section>
288      <h3>C (Until 2.3.0)</h3>
289      <p><pre class="prettyprint lang-c">
290    #include "tensorflow/lite/c/c_api.h"
291    #include "tensorflow/lite/delegates/gpu/metal_delegate.h"
292
293    // Initialize model
294    TfLiteModel* model = TfLiteModelCreateFromFile(model_path);
295
296    // Initialize interpreter with GPU delegate
297    TfLiteInterpreterOptions* options = TfLiteInterpreterOptionsCreate();
298    TfLiteDelegate* delegate = TFLGPUDelegateCreate(nil);  // default config
299    TfLiteInterpreterOptionsAddDelegate(options, metal_delegate);
300    TfLiteInterpreter* interpreter = TfLiteInterpreterCreate(model, options);
301    TfLiteInterpreterOptionsDelete(options);
302
303    TfLiteInterpreterAllocateTensors(interpreter);
304
305    NSMutableData *input_data = [NSMutableData dataWithLength:input_size * sizeof(float)];
306    NSMutableData *output_data = [NSMutableData dataWithLength:output_size * sizeof(float)];
307    TfLiteTensor* input = TfLiteInterpreterGetInputTensor(interpreter, 0);
308    const TfLiteTensor* output = TfLiteInterpreterGetOutputTensor(interpreter, 0);
309
310    // Run inference
311    TfLiteTensorCopyFromBuffer(input, inputData.bytes, inputData.length);
312    TfLiteInterpreterInvoke(interpreter);
313    TfLiteTensorCopyToBuffer(output, outputData.mutableBytes, outputData.length);
314
315    // Clean up
316    TfLiteInterpreterDelete(interpreter);
317    TFLGpuDelegateDelete(metal_delegate);
318    TfLiteModelDelete(model);
319      </pre></p>
320    </section>
321  </devsite-selector>
322</div>
323
324## Supported Models and Ops
325
326With the release of the GPU delegate, we included a handful of models that can
327be run on the backend:
328
329*   [MobileNet v1 (224x224) image classification](https://ai.googleblog.com/2017/06/mobilenets-open-source-models-for.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobilenet_v1_1.0_224.tflite)
330    <br /><i>(image classification model designed for mobile and embedded based vision applications)</i>
331*   [DeepLab segmentation (257x257)](https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/deeplabv3_257_mv_gpu.tflite)
332    <br /><i>(image segmentation model that assigns semantic labels (e.g., dog, cat, car) to every pixel in the input image)</i>
333*   [MobileNet SSD object detection](https://ai.googleblog.com/2018/07/accelerated-training-and-inference-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobile_ssd_v2_float_coco.tflite)
334    <br /><i>(image classification model that detects multiple objects with bounding boxes)</i>
335*   [PoseNet for pose estimation](https://github.com/tensorflow/tfjs-models/tree/master/posenet) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/multi_person_mobilenet_v1_075_float.tflite)
336    <br /><i>(vision model that estimates the poses of a person(s) in image or video)</i>
337
338To see a full list of supported ops, please see the
339[advanced documentation](gpu_advanced.md).
340
341## Non-supported models and ops
342
343If some of the ops are not supported by the GPU delegate, the framework will
344only run a part of the graph on the GPU and the remaining part on the CPU. Due
345to the high cost of CPU/GPU synchronization, a split execution mode like this
346will often result in slower performance than when the whole network is run on
347the CPU alone. In this case, the user will get a warning like:
348
349```none
350WARNING: op code #42 cannot be handled by this delegate.
351```
352
353We did not provide a callback for this failure, as this is not a true run-time
354failure, but something that the developer can observe while trying to get the
355network to run on the delegate.
356
357## Tips for optimization
358
359Some operations that are trivial on the CPU may have a high cost for the GPU.
360One class of such operation is various forms of reshape operations, including
361`BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, and so forth. If those ops
362are inserted into the network just for the network architect's logical thinking,
363it is worth removing them for performance.
364
365On GPU, tensor data is sliced into 4-channels. Thus, a computation on a tensor
366of shape `[B,H,W,5]` will perform about the same on a tensor of shape
367`[B,H,W,8]` but significantly worse than `[B,H,W,4]`.
368
369In that sense, if the camera hardware supports image frames in RGBA, feeding
370that 4-channel input is significantly faster as a memory copy (from 3-channel
371RGB to 4-channel RGBX) can be avoided.
372
373For best performance, do not hesitate to retrain your classifier with a
374mobile-optimized network architecture. That is a significant part of
375optimization for on-device inference.
376