1VC4
2===
3
4Mesa's ``vc4`` graphics driver supports multiple implementations of
5Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0
6through Raspberry Pi 3 hardware, and the driver is included as an
7option as of the 2016-02-09 Rasbpian release using ``raspi-config``.
8On most other distributions such as Debian or Fedora, you need no
9configuration to enable the driver.
10
11This Mesa driver talks directly to the `vc4
12<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM
13driver for scheduling graphics commands, and that module also provides
14KMS display support.  The driver makes no use of the closed source VPU
15firmware on the VideoCore IV block, instead talking directly to the
16GPU block from Linux.
17
18GLES2 support
19-------------
20
21The vc4 driver is a nearly conformant GLES2 driver, and the hardware
22has achieved GLES2 conformance with other driver stacks.
23
24OpenGL support
25--------------
26
27Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is
28mostly correct but with a few caveats.
29
30* 4-byte index buffers.
31
32GLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support
33them in vc4, we create a shadow copy of your index buffer with the
34indices truncated to 2 bytes. This is incorrect (and will assertion
35fail in debug builds of Mesa) if any of the indices were >65535. To
36fix that, we would need to detect this case and rewrite the index
37buffer and vertex buffers to do a series of draws each with small
38indices and new vertex attrib bindings.
39
40To avoid this problem, ensure that all index buffers are written using
41``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls
42with updated vertex attrib bindings.
43
44* Occlusion queries
45
46The VC4 hardware has no support for occlusion queries.  GL 2.0
47requires that you support the occlusion queries extension, but you can
48report 0 from ``glGetQueryiv(GL_SAMPLES_PASSED,
49GL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles
50"we want the functions to be present everywhere, but we want it to be
51optional for hardware to support it. Sadly, gallium doesn't yet allow
52the driver to report 0 query bits.
53
54* Primitive mode
55
56VC4 doesn't support reducing triangles/quads/polygons to lines and
57points like desktop GL. If front/back mode matched, we could rewrite
58the index buffer to the new primitive type, but we don't. If
59front/back mode don't match, we would need to run the vertex shader in
60software, classify the prims, write new index buffers, and emit
61(possibly many) new draw calls to rasterize the new prims in the same
62order.
63
64Bug Reporting
65-------------
66
67VC4 rendering bugs should go to Mesa's gitlab `issues
68<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page.
69
70By far the easiest way to communicate bug reports for rendering
71problems is to take an apitrace. This passes exactly the drawing you
72saw to the developer, without the developer needing to download and
73build the application and replicate whatever steps you took to produce
74the problem.  Traces attached to bug reports should ideally be small.
75
76For GPU hangs, if you can get a short apitrace that produces the
77problem, that's still the best.  If the problem takes a long time to
78reproduce or you can't capture it in a trace, describing how to
79reproduce and including a gpu hang dump would be the most
80useful. Install `vc4-gpu-tools
81<https://github.com/anholt/vc4-gpu-tools/>` and use
82``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will
83provide useful information.
84
85Tiled Rendering
86---------------
87
88VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or
8932x32 (MSAA) tiles and rendering the scene per tile. Rasterization
90looks like::
91
92    (CPU) Allocate space to store a list of draw commands per tile
93    (CPU) Set up a command list per tile that does:
94        Either load the current tile's color buffer from memory, or clear it.
95        Either load the current tile's depth buffer from memory, or clear it.
96        Branch into the draw list for the tile
97        Store the depth buffer if anybody might read it.
98        Store the color buffer if anybody might read it.
99    (GPU) Initialize the per-tile draw call lists to empty.
100    (GPU) Run all draw calls collecting vertex data
101    (GPU) For each tile covered by a draw call's primitive.
102        Emit state packets to the list to update it to the current draw call's state.
103        Emit a primitive description into the tile's draw call list.
104
105Tiled rendering avoids the need for large render target caches, at the
106expense of increasing the cost of vertex processing. Unlike some tiled
107renderers, VC4 has no non-tiled rendering mode.
108
109Performance Tricks
110------------------
111
112* Reducing memory bandwidth by clearing.
113
114Even if your drawing is going to cover the entire render target, it's
115more efficient for VC4 if you emit a ``glClear()`` of the color and
116depth buffers. This means we can skip the load of the previous state
117from memory, in favor of a cheap GPU-side ``memset()`` of the tile
118buffer before we start running the draw calls.
119
120* Reducing memory bandwidth with scissoring.
121
122If all draw calls for the frame are with a ``glScissor()`` to only
123part of the screen, then we can skip setting up the tiles for that
124area, which means a little less memory used setting up the empty bins,
125and a lot less memory used loading/storing the unchanged tiles.
126
127* Reducing memory bandwidth with ``glInvalidateFramebuffer()``.
128
129If we don't know who might use the contents of the framebuffer's depth
130or color in the future, then we have to store it for later. If you use
131glInvalidateFramebuffer() before accessing the results of your
132rendering, then we can skip the store of the depth or color
133buffer. Note that this is unimplemented.
134
135* Avoid non-constant GLSL array indexing
136
137In VC4 the only non-constant-index array access supported in hardware
138is uniforms. For everything else (inputs, outputs, temporaries), we
139have to lower them to an IF ladder like::
140
141  if (index == 0)
142     return array[0]
143  else if (index == 1)
144    return array[1]
145  ...
146
147This is very expensive as we probably have to execute every branch of
148every IF statement due to it being a SIMD machine. So, it is
149recommended (if you can) to avoid non-uniform non-constant array
150indexing.
151
152Note that if you do variable indexing within a bounded loop that Mesa
153can unroll, that can actually count as constant indexing.
154
155* Increasing GPU memory Increase CMA pool size
156
157The memory for the VC4 driver is allocated from the standard Linux cma
158pool. The size of this pool defaults to 64 MB.  To increase this, pass
159an additional parameter on the kernel command line.  Edit the boot
160partition's ``cmdline.txt`` to add::
161
162  cma=256M@256M
163
164``cmdline.txt`` is a single line with whitespace separated parameters.
165
166The first value is the size of the pool and the second parameter is
167the start address of the pool. The pool size can be increased further,
168but it must fit into the memory, so size + start address must be below
1691024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this
170reduces the memory available to Linux.
171
172* Decrease firmware memory
173
174The firmware allocates a fixed chunk of memory before booting
175Linux. If firmware functions are not required, this amount can be
176reduced.
177
178In ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding,
179edit gpu_mem to 64 if you need video decoding.
180
181Performance debugging
182---------------------
183
184* Step 1: Known issues
185
186The first tool to look at is running your application with the
187environment variable ``VC4_DEBUG=perf`` set. This will report debug
188information for many known causes of performance problems on the
189console. Not all of them will cause visible performance improvements
190when fixed, but it's a good first step to see what might going wrong.
191
192* Step 2: CPU vs GPU
193
194The primary question is figuring out whether the CPU is busy in your
195application, the CPU is busy in the GL driver, the GPU is waiting for
196the CPU, or the CPU is waiting for the GPU. Ideally, you get to the
197point where the CPU is waiting for the GPU infrequently but for a
198significant amount of time (however long it takes the GPU to draw a
199frame).
200
201Start with top while your application is running. Is the CPU usage
202around 90%+? If so, then our performance analysis will be with
203sysprof. If it's not very high, is the GPU staying busy? We don't have
204a clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be
205useful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that
206means that the GPU is currently busy processing some rendering job.
207
208* sysprof for CPU usage
209
210If the CPU is totally busy and the GPU isn't terribly busy, there is
211an excellent tool for debugging: sysprof. Install, run as root (so you
212can get system-wide profiling), hit play and later stop. The top-left
213area shows the flat profile sorted by total time of that symbol plus
214its descendants. The top few are generally uninteresting (main() and
215its descendants consuming a lot), but eventually you can get down to
216something interesting. Click it, and to the right you get the
217callchains to descendants -- where all that time actually went. On the
218other hand, the lower left shows callers -- double-clicking those
219selects that as the symbol to view, instead.
220
221Note that you need debug symbols for the callgraphs in sysprof to
222work, which is where most of its value is. Most distributions offer
223debug symbol packages from their builds which can be installed
224separately, and sysprof will find them. I've found that on arm, the
225debug packages are not enough, and if someone could determine what is
226necessary for callgraphs in debugging, that would be really helpful.
227
228* perf for CPU waits on GPU
229
230If the CPU is not very busy and the GPU is not very busy, then we're
231probably ping-ponging between the two. Most cases of this would be
232noticed by ``VC4_DEBUG=perf``, but not all. To see all cases where
233this happens, use the perf tool from the Linux kernel (note: unrelated
234to ``VC4_DEBUG=perf``)::
235
236    sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena
237
238If you want to see the whole system's stalls for a period of time
239(very useful!), use the -a flag instead of a particular command
240name. Just ``^C`` when you're done capturing data.
241
242At exit, you'll have ``perf.data`` in the current directory. You can print
243out the results with::
244
245    perf report | less
246
247* Debugging for GPU fully busy
248
249As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's
250performance counters in OpenGL. Install apitrace, and trace your
251application with::
252
253    apitrace trace <application>          # for GLX applications
254    apitrace trace -a egl <application>   # for EGL applications
255
256Once you've captured a trace, you can see what counters are available
257and replay it while looking while looking at some of those counters::
258
259    apitrace replay <application>.trace --list-metrics
260
261    apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading
262
263Multiple counters can be captured at once with commas separating them.
264
265Once you've found what draw calls are surprisingly expensive in one of
266the counters, you can work out which ones they were at the GL level by
267opening the trace up in qapitrace and using ``^-G`` to jump to that call
268number and ``^-L`` to look up the GL state at that call.
269
270shader-db
271---------
272
273shader-db is often used as a proxy for real-world app performance when
274working on the compiler in Mesa.  On vc4, there is a lot of
275state-dependent code in the shaders (like blending or vertex attribute
276format handling), so the typical `shader-db
277<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important
278areas for optimization.  Instead, anholt wrote a `new one
279<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on
280apitraces.  Once you have a collection of traces, starting from
281`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__,
282you can test a compiler change in this shader-db with::
283
284  ./run.py > before
285  (cd ../mesa && make install)
286  ./run.py > after
287  ./report.py before after
288
289Hardware Documentation
290----------------------
291
292For driver developers, Broadcom publicly released a `specification
293<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which
294is closely related to the vc4 GPU present in the Raspberry Pi.  They
295also released a `snapshot <https://docs.broadcom.com/docs/12358546>`__
296of a corresponding Android graphics driver.  That graphics driver was
297ported to Raspbian for a demo, but was not expected to have ongoing
298development.
299
300Developers with NDA access with Broadcom or Raspberry Pi can
301potentially get access to "simpenrose", the C software simulator of
302the GPU.  The Mesa driver includes a backend (`vc4_simulator.c`) to
303use simpenrose from an x86 system with the i915 graphics driver with
304all of the vc4 rendering commands emulated on simpenrose and memcpyed
305to the real GPU.
306