• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..--

cl/23-Nov-2023-19,66415,448

common/23-Nov-2023-53,04341,434

gl/23-Nov-2023-22,68516,234

java/src/main/23-Nov-2023-401220

metal/23-Nov-2023-7,8436,304

BUILDD23-Nov-20239.1 KiB270254

README.mdD23-Nov-20235.9 KiB167133

api.ccD23-Nov-20237.1 KiB205166

api.hD23-Nov-202312.6 KiB402217

delegate.ccD23-Nov-202318.1 KiB468384

delegate.hD23-Nov-20235.4 KiB13440

gl_delegate.ccD23-Nov-202320 KiB511405

gl_delegate.hD23-Nov-20235.3 KiB13540

metal_delegate.hD23-Nov-20232.7 KiB7528

metal_delegate.mmD23-Nov-202329.5 KiB726668

metal_delegate_internal.hD23-Nov-20231.8 KiB4211

spi.hD23-Nov-20232.6 KiB8744

README.md

1# TFLite on GPU
2
3TensorFlow Lite (TFLite) supports several hardware accelerators.  This document
4describes how to use the GPU backend using the TFLite delegate APIs on Android
5and iOS.
6
7GPUs are designed to have high throughput for massively parallelizable
8workloads.  Thus, they are well-suited for deep neural nets which consists of a
9huge number of operators, each working on some input tensor(s) that can be
10easily divided into smaller workloads and carried out in parallel, typically
11resulting in lower latency.  In the best scenario, inference on the GPU may now
12run fast enough and now become suitable for real-time applications if it was not
13before.
14
15GPUs do their computation with 16-bit or 32-bit floating point numbers and do
16not require quantization for optimal performance unlike the CPUs.  If
17quantization of your neural network was not an option due to lower accuracy
18caused by lost precision, such concern can be discarded when running deep neural
19net models on the GPU.
20
21Another benefit that comes with GPU inference is its power efficiency.  GPUs
22carry out the computations in a very efficient and optimized way, so that they
23consume less power and generate less heat than when the same task is run on the
24CPUs.
25
26TFLite on GPU supports the following ops in 16-bit and 32-bit float precision:
27
28* `ADD v1`
29* `AVERAGE_POOL_2D v1`
30* `CONCATENATION v1`
31* `CONV_2D v1`
32* `DEPTHWISE_CONV_2D v1-2`
33* `EXP v1`
34* `FULLY_CONNECTED v1`
35* `LOGISTIC v1`
36* `LSTM v2 (Basic LSTM only)`
37* `MAX_POOL_2D v1`
38* `MAXIMUM v1`
39* `MINIMUM v1`
40* `MUL v1`
41* `PAD v1`
42* `PRELU v1`
43* `RELU v1`
44* `RELU6 v1`
45* `RESHAPE v1`
46* `RESIZE_BILINEAR v1-3`
47* `SOFTMAX v1`
48* `STRIDED_SLICE v1`
49* `SUB v1`
50* `TRANSPOSE_CONV v1`
51
52## Basic Usage
53
54**Note:** Following section describes the example usage for Android GPU delegate
55with C++. For other languages and platforms, please see
56[the documentation](https://www.tensorflow.org/lite/performance/gpu).
57
58Using TFLite on GPU is as simple as getting the GPU delegate via
59`TfLiteGpuDelegateV2Create()` and then passing it to
60`Interpreter::ModifyGraphWithDelegate()` instead of calling
61`Interpreter::AllocateTensors()`:
62
63```c++
64////////
65// Set up interpreter.
66auto model = FlatBufferModel::BuildFromFile(model_path);
67ops::builtin::BuiltinOpResolver op_resolver;
68std::unique_ptr<Interpreter> interpreter;
69InterpreterBuilder(*model, op_resolver)(&interpreter);
70
71////////
72// NEW: Prepare GPU delegate.
73auto* delegate = TfLiteGpuDelegateV2Create(/*default options=*/nullptr);
74if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return;
75
76////////
77// Run inference.
78WriteToInputTensor(interpreter->typed_input_tensor<float>(0));
79if (interpreter->Invoke() != kTfLiteOk) return;
80ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0));
81
82////////
83// Clean up.
84TfLiteGpuDelegateV2Delete(delegate);
85```
86
87*IMPORTANT:* When calling `Interpreter::ModifyGraphWithDelegate()` or
88`Interpreter::Invoke()`, the caller must have a `EGLContext` in the current
89thread and `Interpreter::Invoke()` must be called from the same `EGLContext`.
90If such `EGLContext` does not exist, the delegate will internally create one,
91but then the developer must ensure that `Interpreter::Invoke()` is always called
92from the same thread `Interpreter::ModifyGraphWithDelegate()` was called.
93
94## Building and Runtime
95
96TFLite GPU backend uses OpenGL ES 3.1 compute shaders or OpenCL.
97
98```sh
99bazel build --config android_arm64 //path/to/your:project
100```
101
102Metal shaders are used for iOS, which were introduced with iOS 8.  Thus,
103compilation flags should look like:
104
105```sh
106bazel build --config ios_fat //path/to/your:project
107```
108
109## Advanced Usage: Delegate Options
110
111There are GPU options that can be set and passed on to
112`TfLiteGpuDelegateV2Create()`. When option is set to `nullptr` as shown in the
113Basic Usage, it translates to:
114
115```c++
116const TfLiteGpuDelegateOptionsV2 kDefaultOptions =
117    TfLiteGpuDelegateOptionsV2Default();
118```
119
120Similar for `TFLGpuDelegateCreate()`:
121
122```c++
123const TFLGpuDelegateOptions kDefaultOptions = {
124  .allow_precision_loss = false,
125  .wait_type = TFLGpuDelegateWaitTypePassive,
126  .enable_quantization = false,
127};
128```
129
130While it is convenient to just supply `nullptr`, it is recommended to explicitly
131set the options to avoid any unexpected artifacts in case default values are
132changed.
133
134*IMPORTANT:* Note that the default option may not be the fastest. For faster
135execution, you may want to set `allow_precision_loss` to `true` so that the GPU
136performs FP16 calculation internally, and set `wait_type` to
137`TFLGpuDelegateWaitTypeAggressive` to avoid GPU sleep mode.
138
139## Tips and Tricks
140
141* Some operations that are trivial on CPU side may be high cost in GPU land.
142  One class of such operation is various forms of reshape operations (including
143  `BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, etc.).  If those ops
144  are inserted into the network just for the network architect's logical
145  thinking, it is worth removing them for performance.
146
147* On GPU, tensor data is sliced into 4-channels.  Thus, a computation on a
148  tensor of shape `[B, H, W, 5]` will perform about the same on a tensor of
149  shape `[B, H, W, 8]`, but significantly worse than `[B, H, W, 4]`.
150
151* In that sense, if the camera hardware supports image frames in RGBA, feeding
152  that 4-channel input is significantly faster as a memory copy (from 3-channel
153  RGB to 4-channel RGBX) can be avoided.
154
155* For performance [best practices](https://www.tensorflow.org/lite/performance/best_practices), do not hesitate to re-train your classifier with
156  mobile-optimized network architecture.  That is a significant part of
157  optimization for on-device inference.
158
159## Publication
160
161*   [On-Device Neural Net Inference with Mobile GPUs](https://arxiv.org/abs/1907.01989)
162    *   Juhyun Lee, Nikolay Chirkov, Ekaterina Ignasheva, Yury Pisarchyk, Mogan
163        Shieh, Fabio Riccardi, Raman Sarokin, Andrei Kulik, and Matthias
164        Grundmann
165    *   CVPR Workshop
166        [Efficient Deep Learning for Computer Vision (ECV2019)](https://sites.google.com/corp/view/ecv2019)
167