1# TensorFlow Lite GPU delegate 2 3[TensorFlow Lite](https://www.tensorflow.org/lite) supports several hardware 4accelerators. This document describes how to use the GPU backend using the 5TensorFlow Lite delegate APIs on Android and iOS. 6 7GPUs are designed to have high throughput for massively parallelizable 8workloads. Thus, they are well-suited for deep neural nets, which consist of a 9huge number of operators, each working on some input tensor(s) that can be 10easily divided into smaller workloads and carried out in parallel, typically 11resulting in lower latency. In the best scenario, inference on the GPU may now 12run fast enough for previously not available real-time applications. 13 14Unlike CPUs, GPUs compute with 16-bit or 32-bit floating point numbers and do 15not require quantization for optimal performance. The delegate does accept 8-bit 16quantized models, but the calculation will be performed in floating point 17numbers. Refer to the [advanced documentation](gpu_advanced.md) for details. 18 19Another benefit with GPU inference is its power efficiency. GPUs carry out the 20computations in a very efficient and optimized manner, so that they consume less 21power and generate less heat than when the same task is run on CPUs. 22 23## Demo app tutorials 24 25The easiest way to try out the GPU delegate is to follow the below tutorials, 26which go through building our classification demo applications with GPU support. 27The GPU code is only binary for now; it will be open-sourced soon. Once you 28understand how to get our demos working, you can try this out on your own custom 29models. 30 31### Android (with Android Studio) 32 33For a step-by-step tutorial, watch the 34[GPU Delegate for Android](https://youtu.be/Xkhgre8r5G0) video. 35 36Note: This requires OpenCL or OpenGL ES (3.1 or higher). 37 38#### Step 1. Clone the TensorFlow source code and open it in Android Studio 39 40```sh 41git clone https://github.com/tensorflow/tensorflow 42``` 43 44#### Step 2. Edit `app/build.gradle` to use the nightly GPU AAR 45 46Add the `tensorflow-lite-gpu` package alongside the existing `tensorflow-lite` 47package in the existing `dependencies` block. 48 49``` 50dependencies { 51 ... 52 implementation 'org.tensorflow:tensorflow-lite:2.3.0' 53 implementation 'org.tensorflow:tensorflow-lite-gpu:2.3.0' 54} 55``` 56 57#### Step 3. Build and run 58 59Run → Run ‘app’. When you run the application you will see a button for enabling 60the GPU. Change from quantized to a float model and then click GPU to run on the 61GPU. 62 63![running android gpu demo and switch to gpu](images/android_gpu_demo.gif) 64 65### iOS (with XCode) 66 67For a step-by-step tutorial, watch the 68[GPU Delegate for iOS](https://youtu.be/a5H4Zwjp49c) video. 69 70Note: This requires XCode v10.1 or later. 71 72#### Step 1. Get the demo source code and make sure it compiles. 73 74Follow our iOS Demo App [tutorial](https://www.tensorflow.org/lite/demo_ios). 75This will get you to a point where the unmodified iOS camera demo is working on 76your phone. 77 78#### Step 2. Modify the Podfile to use the TensorFlow Lite GPU CocoaPod 79 80From 2.3.0 release, by default GPU delegate is excluded from the pod to reduce 81the binary size. You can include them by specifying subspec. For 82`TensorFlowLiteSwift` pod: 83 84```ruby 85pod 'TensorFlowLiteSwift/Metal', '~> 0.0.1-nightly', 86``` 87 88OR 89 90```ruby 91pod 'TensorFlowLiteSwift', '~> 0.0.1-nightly', :subspecs => ['Metal'] 92``` 93 94You can do similarly for `TensorFlowLiteObjC` or `TensorFlowLitC` if you want to 95use the Objective-C (from 2.4.0 release) or C API. 96 97<div> 98 <devsite-expandable> 99 <h4 class="showalways">Before 2.3.0 release</h4> 100 <h4>Until TensorFlow Lite 2.0.0</h4> 101 <p> 102 We have built a binary CocoaPod that includes the GPU delegate. To switch 103 the project to use it, modify the 104 `tensorflow/tensorflow/lite/examples/ios/camera/Podfile` file to use the 105 `TensorFlowLiteGpuExperimental` pod instead of `TensorFlowLite`. 106 </p> 107 <pre class="prettyprint lang-ruby notranslate" translate="no"><code> 108 target 'YourProjectName' 109 # pod 'TensorFlowLite', '1.12.0' 110 pod 'TensorFlowLiteGpuExperimental' 111 </code></pre> 112 <h4>Until TensorFlow Lite 2.2.0</h4> 113 <p> 114 From TensorFlow Lite 2.1.0 to 2.2.0, GPU delegate is included in the 115 `TensorFlowLiteC` pod. You can choose between `TensorFlowLiteC` and 116 `TensorFlowLiteSwift` depending on the language. 117 </p> 118 </devsite-expandable> 119</div> 120 121#### Step 3. Enable the GPU delegate 122 123To enable the code that will use the GPU delegate, you will need to change 124`TFLITE_USE_GPU_DELEGATE` from 0 to 1 in `CameraExampleViewController.h`. 125 126```c 127#define TFLITE_USE_GPU_DELEGATE 1 128``` 129 130#### Step 4. Build and run the demo app 131 132After following the previous step, you should be able to run the app. 133 134#### Step 5. Release mode 135 136While in Step 4 you ran in debug mode, to get better performance, you should 137change to a release build with the appropriate optimal Metal settings. In 138particular, To edit these settings go to the `Product > Scheme > Edit 139Scheme...`. Select `Run`. On the `Info` tab, change `Build Configuration`, from 140`Debug` to `Release`, uncheck `Debug executable`. 141 142![setting up release](images/iosdebug.png) 143 144Then click the `Options` tab and change `GPU Frame Capture` to `Disabled` and 145`Metal API Validation` to `Disabled`. 146 147![setting up metal options](images/iosmetal.png) 148 149Lastly make sure to select Release-only builds on 64-bit architecture. Under 150`Project navigator -> tflite_camera_example -> PROJECT -> tflite_camera_example 151-> Build Settings` set `Build Active Architecture Only > Release` to Yes. 152 153![setting up release options](images/iosrelease.png) 154 155## Trying the GPU delegate on your own model 156 157### Android 158 159Note: The TensorFlow Lite Interpreter must be created on the same thread as 160where it is run. Otherwise, `TfLiteGpuDelegate Invoke: GpuDelegate must run on 161the same thread where it was initialized.` may occur. 162 163There are two ways to invoke model acceleration depending on if you are using 164[Android Studio ML Model Binding](../inference_with_metadata/codegen#acceleration) 165or TensorFlow Lite Interpreter. 166 167#### TensorFlow Lite Interpreter 168 169Look at the demo to see how to add the delegate. In your application, add the 170AAR as above, import `org.tensorflow.lite.gpu.GpuDelegate` module, and use 171the`addDelegate` function to register the GPU delegate to the interpreter: 172 173<div> 174 <devsite-selector> 175 <section> 176 <h3>Kotlin</h3> 177 <p><pre class="prettyprint lang-kotlin"> 178 import org.tensorflow.lite.Interpreter 179 import org.tensorflow.lite.gpu.CompatibilityList 180 import org.tensorflow.lite.gpu.GpuDelegate 181 182 val compatList = CompatibilityList() 183 184 val options = Interpreter.Options().apply{ 185 if(compatList.isDelegateSupportedOnThisDevice){ 186 // if the device has a supported GPU, add the GPU delegate 187 val delegateOptions = compatList.bestOptionsForThisDevice 188 this.addDelegate(GpuDelegate(delegateOptions)) 189 } else { 190 // if the GPU is not supported, run on 4 threads 191 this.setNumThreads(4) 192 } 193 } 194 195 val interpreter = Interpreter(model, options) 196 197 // Run inference 198 writeToInput(input) 199 interpreter.run(input, output) 200 readFromOutput(output) 201 </pre></p> 202 </section> 203 <section> 204 <h3>Java</h3> 205 <p><pre class="prettyprint lang-java"> 206 import org.tensorflow.lite.Interpreter; 207 import org.tensorflow.lite.gpu.CompatibilityList; 208 import org.tensorflow.lite.gpu.GpuDelegate; 209 210 // Initialize interpreter with GPU delegate 211 Interpreter.Options options = new Interpreter.Options(); 212 CompatibilityList compatList = CompatibilityList(); 213 214 if(compatList.isDelegateSupportedOnThisDevice()){ 215 // if the device has a supported GPU, add the GPU delegate 216 GpuDelegate.Options delegateOptions = compatList.getBestOptionsForThisDevice(); 217 GpuDelegate gpuDelegate = new GpuDelegate(delegateOptions); 218 options.addDelegate(gpuDelegate); 219 } else { 220 // if the GPU is not supported, run on 4 threads 221 options.setNumThreads(4); 222 } 223 224 Interpreter interpreter = new Interpreter(model, options); 225 226 // Run inference 227 writeToInput(input); 228 interpreter.run(input, output); 229 readFromOutput(output); 230 </pre></p> 231 </section> 232 </devsite-selector> 233</div> 234 235### iOS 236 237Note: GPU delegate can also use C API for Objective-C code. Prior to TensorFlow 238Lite 2.4.0 release, this was the only option. 239 240<div> 241 <devsite-selector> 242 <section> 243 <h3>Swift</h3> 244 <p><pre class="prettyprint lang-swift"> 245 import TensorFlowLite 246 247 // Load model ... 248 249 // Initialize TensorFlow Lite interpreter with the GPU delegate. 250 let delegate = MetalDelegate() 251 if let interpreter = try Interpreter(modelPath: modelPath, 252 delegates: [delegate]) { 253 // Run inference ... 254 } 255 </pre></p> 256 </section> 257 <section> 258 <h3>Objective-C</h3> 259 <p><pre class="prettyprint lang-objc"> 260 // Import module when using CocoaPods with module support 261 @import TFLTensorFlowLite; 262 263 // Or import following headers manually 264 #import "tensorflow/lite/objc/apis/TFLMetalDelegate.h" 265 #import "tensorflow/lite/objc/apis/TFLTensorFlowLite.h" 266 267 // Initialize GPU delegate 268 TFLMetalDelegate* metalDelegate = [[TFLMetalDelegate alloc] init]; 269 270 // Initialize interpreter with model path and GPU delegate 271 TFLInterpreterOptions* options = [[TFLInterpreterOptions alloc] init]; 272 NSError* error = nil; 273 TFLInterpreter* interpreter = [[TFLInterpreter alloc] 274 initWithModelPath:modelPath 275 options:options 276 delegates:@[ metalDelegate ] 277 error:&error]; 278 if (error != nil) { /* Error handling... */ } 279 280 if (![interpreter allocateTensorsWithError:&error]) { /* Error handling... */ } 281 if (error != nil) { /* Error handling... */ } 282 283 // Run inference ... 284 ``` 285 </pre></p> 286 </section> 287 <section> 288 <h3>C (Until 2.3.0)</h3> 289 <p><pre class="prettyprint lang-c"> 290 #include "tensorflow/lite/c/c_api.h" 291 #include "tensorflow/lite/delegates/gpu/metal_delegate.h" 292 293 // Initialize model 294 TfLiteModel* model = TfLiteModelCreateFromFile(model_path); 295 296 // Initialize interpreter with GPU delegate 297 TfLiteInterpreterOptions* options = TfLiteInterpreterOptionsCreate(); 298 TfLiteDelegate* delegate = TFLGPUDelegateCreate(nil); // default config 299 TfLiteInterpreterOptionsAddDelegate(options, metal_delegate); 300 TfLiteInterpreter* interpreter = TfLiteInterpreterCreate(model, options); 301 TfLiteInterpreterOptionsDelete(options); 302 303 TfLiteInterpreterAllocateTensors(interpreter); 304 305 NSMutableData *input_data = [NSMutableData dataWithLength:input_size * sizeof(float)]; 306 NSMutableData *output_data = [NSMutableData dataWithLength:output_size * sizeof(float)]; 307 TfLiteTensor* input = TfLiteInterpreterGetInputTensor(interpreter, 0); 308 const TfLiteTensor* output = TfLiteInterpreterGetOutputTensor(interpreter, 0); 309 310 // Run inference 311 TfLiteTensorCopyFromBuffer(input, inputData.bytes, inputData.length); 312 TfLiteInterpreterInvoke(interpreter); 313 TfLiteTensorCopyToBuffer(output, outputData.mutableBytes, outputData.length); 314 315 // Clean up 316 TfLiteInterpreterDelete(interpreter); 317 TFLGpuDelegateDelete(metal_delegate); 318 TfLiteModelDelete(model); 319 </pre></p> 320 </section> 321 </devsite-selector> 322</div> 323 324## Supported Models and Ops 325 326With the release of the GPU delegate, we included a handful of models that can 327be run on the backend: 328 329* [MobileNet v1 (224x224) image classification](https://ai.googleblog.com/2017/06/mobilenets-open-source-models-for.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobilenet_v1_1.0_224.tflite) 330 <br /><i>(image classification model designed for mobile and embedded based vision applications)</i> 331* [DeepLab segmentation (257x257)](https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/deeplabv3_257_mv_gpu.tflite) 332 <br /><i>(image segmentation model that assigns semantic labels (e.g., dog, cat, car) to every pixel in the input image)</i> 333* [MobileNet SSD object detection](https://ai.googleblog.com/2018/07/accelerated-training-and-inference-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobile_ssd_v2_float_coco.tflite) 334 <br /><i>(image classification model that detects multiple objects with bounding boxes)</i> 335* [PoseNet for pose estimation](https://github.com/tensorflow/tfjs-models/tree/master/posenet) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/multi_person_mobilenet_v1_075_float.tflite) 336 <br /><i>(vision model that estimates the poses of a person(s) in image or video)</i> 337 338To see a full list of supported ops, please see the 339[advanced documentation](gpu_advanced.md). 340 341## Non-supported models and ops 342 343If some of the ops are not supported by the GPU delegate, the framework will 344only run a part of the graph on the GPU and the remaining part on the CPU. Due 345to the high cost of CPU/GPU synchronization, a split execution mode like this 346will often result in slower performance than when the whole network is run on 347the CPU alone. In this case, the user will get a warning like: 348 349```none 350WARNING: op code #42 cannot be handled by this delegate. 351``` 352 353We did not provide a callback for this failure, as this is not a true run-time 354failure, but something that the developer can observe while trying to get the 355network to run on the delegate. 356 357## Tips for optimization 358 359Some operations that are trivial on the CPU may have a high cost for the GPU. 360One class of such operation is various forms of reshape operations, including 361`BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, and so forth. If those ops 362are inserted into the network just for the network architect's logical thinking, 363it is worth removing them for performance. 364 365On GPU, tensor data is sliced into 4-channels. Thus, a computation on a tensor 366of shape `[B,H,W,5]` will perform about the same on a tensor of shape 367`[B,H,W,8]` but significantly worse than `[B,H,W,4]`. 368 369In that sense, if the camera hardware supports image frames in RGBA, feeding 370that 4-channel input is significantly faster as a memory copy (from 3-channel 371RGB to 4-channel RGBX) can be avoided. 372 373For best performance, do not hesitate to retrain your classifier with a 374mobile-optimized network architecture. That is a significant part of 375optimization for on-device inference. 376