1VC4 2=== 3 4Mesa's ``vc4`` graphics driver supports multiple implementations of 5Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0 6through Raspberry Pi 3 hardware, and the driver is included as an 7option as of the 2016-02-09 Rasbpian release using ``raspi-config``. 8On most other distributions such as Debian or Fedora, you need no 9configuration to enable the driver. 10 11This Mesa driver talks directly to the `vc4 12<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM 13driver for scheduling graphics commands, and that module also provides 14KMS display support. The driver makes no use of the closed source VPU 15firmware on the VideoCore IV block, instead talking directly to the 16GPU block from Linux. 17 18GLES2 support 19------------- 20 21The vc4 driver is a nearly conformant GLES2 driver, and the hardware 22has achieved GLES2 conformance with other driver stacks. 23 24OpenGL support 25-------------- 26 27Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is 28mostly correct but with a few caveats. 29 30* 4-byte index buffers. 31 32GLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support 33them in vc4, we create a shadow copy of your index buffer with the 34indices truncated to 2 bytes. This is incorrect (and will assertion 35fail in debug builds of Mesa) if any of the indices were >65535. To 36fix that, we would need to detect this case and rewrite the index 37buffer and vertex buffers to do a series of draws each with small 38indices and new vertex attrib bindings. 39 40To avoid this problem, ensure that all index buffers are written using 41``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls 42with updated vertex attrib bindings. 43 44* Occlusion queries 45 46The VC4 hardware has no support for occlusion queries. GL 2.0 47requires that you support the occlusion queries extension, but you can 48report 0 from ``glGetQueryiv(GL_SAMPLES_PASSED, 49GL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles 50"we want the functions to be present everywhere, but we want it to be 51optional for hardware to support it. Sadly, gallium doesn't yet allow 52the driver to report 0 query bits. 53 54* Primitive mode 55 56VC4 doesn't support reducing triangles/quads/polygons to lines and 57points like desktop GL. If front/back mode matched, we could rewrite 58the index buffer to the new primitive type, but we don't. If 59front/back mode don't match, we would need to run the vertex shader in 60software, classify the prims, write new index buffers, and emit 61(possibly many) new draw calls to rasterize the new prims in the same 62order. 63 64Bug Reporting 65------------- 66 67VC4 rendering bugs should go to Mesa's gitlab `issues 68<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page. 69 70By far the easiest way to communicate bug reports for rendering 71problems is to take an apitrace. This passes exactly the drawing you 72saw to the developer, without the developer needing to download and 73build the application and replicate whatever steps you took to produce 74the problem. Traces attached to bug reports should ideally be small. 75 76For GPU hangs, if you can get a short apitrace that produces the 77problem, that's still the best. If the problem takes a long time to 78reproduce or you can't capture it in a trace, describing how to 79reproduce and including a gpu hang dump would be the most 80useful. Install `vc4-gpu-tools 81<https://github.com/anholt/vc4-gpu-tools/>` and use 82``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will 83provide useful information. 84 85Tiled Rendering 86--------------- 87 88VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or 8932x32 (MSAA) tiles and rendering the scene per tile. Rasterization 90looks like:: 91 92 (CPU) Allocate space to store a list of draw commands per tile 93 (CPU) Set up a command list per tile that does: 94 Either load the current tile's color buffer from memory, or clear it. 95 Either load the current tile's depth buffer from memory, or clear it. 96 Branch into the draw list for the tile 97 Store the depth buffer if anybody might read it. 98 Store the color buffer if anybody might read it. 99 (GPU) Initialize the per-tile draw call lists to empty. 100 (GPU) Run all draw calls collecting vertex data 101 (GPU) For each tile covered by a draw call's primitive. 102 Emit state packets to the list to update it to the current draw call's state. 103 Emit a primitive description into the tile's draw call list. 104 105Tiled rendering avoids the need for large render target caches, at the 106expense of increasing the cost of vertex processing. Unlike some tiled 107renderers, VC4 has no non-tiled rendering mode. 108 109Performance Tricks 110------------------ 111 112* Reducing memory bandwidth by clearing. 113 114Even if your drawing is going to cover the entire render target, it's 115more efficient for VC4 if you emit a ``glClear()`` of the color and 116depth buffers. This means we can skip the load of the previous state 117from memory, in favor of a cheap GPU-side ``memset()`` of the tile 118buffer before we start running the draw calls. 119 120* Reducing memory bandwidth with scissoring. 121 122If all draw calls for the frame are with a ``glScissor()`` to only 123part of the screen, then we can skip setting up the tiles for that 124area, which means a little less memory used setting up the empty bins, 125and a lot less memory used loading/storing the unchanged tiles. 126 127* Reducing memory bandwidth with ``glInvalidateFramebuffer()``. 128 129If we don't know who might use the contents of the framebuffer's depth 130or color in the future, then we have to store it for later. If you use 131glInvalidateFramebuffer() before accessing the results of your 132rendering, then we can skip the store of the depth or color 133buffer. Note that this is unimplemented. 134 135* Avoid non-constant GLSL array indexing 136 137In VC4 the only non-constant-index array access supported in hardware 138is uniforms. For everything else (inputs, outputs, temporaries), we 139have to lower them to an IF ladder like:: 140 141 if (index == 0) 142 return array[0] 143 else if (index == 1) 144 return array[1] 145 ... 146 147This is very expensive as we probably have to execute every branch of 148every IF statement due to it being a SIMD machine. So, it is 149recommended (if you can) to avoid non-uniform non-constant array 150indexing. 151 152Note that if you do variable indexing within a bounded loop that Mesa 153can unroll, that can actually count as constant indexing. 154 155* Increasing GPU memory Increase CMA pool size 156 157The memory for the VC4 driver is allocated from the standard Linux cma 158pool. The size of this pool defaults to 64 MB. To increase this, pass 159an additional parameter on the kernel command line. Edit the boot 160partition's ``cmdline.txt`` to add:: 161 162 cma=256M@256M 163 164``cmdline.txt`` is a single line with whitespace separated parameters. 165 166The first value is the size of the pool and the second parameter is 167the start address of the pool. The pool size can be increased further, 168but it must fit into the memory, so size + start address must be below 1691024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this 170reduces the memory available to Linux. 171 172* Decrease firmware memory 173 174The firmware allocates a fixed chunk of memory before booting 175Linux. If firmware functions are not required, this amount can be 176reduced. 177 178In ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding, 179edit gpu_mem to 64 if you need video decoding. 180 181Performance debugging 182--------------------- 183 184* Step 1: Known issues 185 186The first tool to look at is running your application with the 187environment variable ``VC4_DEBUG=perf`` set. This will report debug 188information for many known causes of performance problems on the 189console. Not all of them will cause visible performance improvements 190when fixed, but it's a good first step to see what might going wrong. 191 192* Step 2: CPU vs GPU 193 194The primary question is figuring out whether the CPU is busy in your 195application, the CPU is busy in the GL driver, the GPU is waiting for 196the CPU, or the CPU is waiting for the GPU. Ideally, you get to the 197point where the CPU is waiting for the GPU infrequently but for a 198significant amount of time (however long it takes the GPU to draw a 199frame). 200 201Start with top while your application is running. Is the CPU usage 202around 90%+? If so, then our performance analysis will be with 203sysprof. If it's not very high, is the GPU staying busy? We don't have 204a clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be 205useful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that 206means that the GPU is currently busy processing some rendering job. 207 208* sysprof for CPU usage 209 210If the CPU is totally busy and the GPU isn't terribly busy, there is 211an excellent tool for debugging: sysprof. Install, run as root (so you 212can get system-wide profiling), hit play and later stop. The top-left 213area shows the flat profile sorted by total time of that symbol plus 214its descendants. The top few are generally uninteresting (main() and 215its descendants consuming a lot), but eventually you can get down to 216something interesting. Click it, and to the right you get the 217callchains to descendants -- where all that time actually went. On the 218other hand, the lower left shows callers -- double-clicking those 219selects that as the symbol to view, instead. 220 221Note that you need debug symbols for the callgraphs in sysprof to 222work, which is where most of its value is. Most distributions offer 223debug symbol packages from their builds which can be installed 224separately, and sysprof will find them. I've found that on arm, the 225debug packages are not enough, and if someone could determine what is 226necessary for callgraphs in debugging, that would be really helpful. 227 228* perf for CPU waits on GPU 229 230If the CPU is not very busy and the GPU is not very busy, then we're 231probably ping-ponging between the two. Most cases of this would be 232noticed by ``VC4_DEBUG=perf``, but not all. To see all cases where 233this happens, use the perf tool from the Linux kernel (note: unrelated 234to ``VC4_DEBUG=perf``):: 235 236 sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena 237 238If you want to see the whole system's stalls for a period of time 239(very useful!), use the -a flag instead of a particular command 240name. Just ``^C`` when you're done capturing data. 241 242At exit, you'll have ``perf.data`` in the current directory. You can print 243out the results with:: 244 245 perf report | less 246 247* Debugging for GPU fully busy 248 249As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's 250performance counters in OpenGL. Install apitrace, and trace your 251application with:: 252 253 apitrace trace <application> # for GLX applications 254 apitrace trace -a egl <application> # for EGL applications 255 256Once you've captured a trace, you can see what counters are available 257and replay it while looking while looking at some of those counters:: 258 259 apitrace replay <application>.trace --list-metrics 260 261 apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading 262 263Multiple counters can be captured at once with commas separating them. 264 265Once you've found what draw calls are surprisingly expensive in one of 266the counters, you can work out which ones they were at the GL level by 267opening the trace up in qapitrace and using ``^-G`` to jump to that call 268number and ``^-L`` to look up the GL state at that call. 269 270shader-db 271--------- 272 273shader-db is often used as a proxy for real-world app performance when 274working on the compiler in Mesa. On vc4, there is a lot of 275state-dependent code in the shaders (like blending or vertex attribute 276format handling), so the typical `shader-db 277<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important 278areas for optimization. Instead, anholt wrote a `new one 279<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on 280apitraces. Once you have a collection of traces, starting from 281`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__, 282you can test a compiler change in this shader-db with:: 283 284 ./run.py > before 285 (cd ../mesa && make install) 286 ./run.py > after 287 ./report.py before after 288 289Hardware Documentation 290---------------------- 291 292For driver developers, Broadcom publicly released a `specification 293<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which 294is closely related to the vc4 GPU present in the Raspberry Pi. They 295also released a `snapshot <https://docs.broadcom.com/docs/12358546>`__ 296of a corresponding Android graphics driver. That graphics driver was 297ported to Raspbian for a demo, but was not expected to have ongoing 298development. 299 300Developers with NDA access with Broadcom or Raspberry Pi can 301potentially get access to "simpenrose", the C software simulator of 302the GPU. The Mesa driver includes a backend (`vc4_simulator.c`) to 303use simpenrose from an x86 system with the i915 graphics driver with 304all of the vc4 rendering commands emulated on simpenrose and memcpyed 305to the real GPU. 306