docs/drivers/vc4.rst

VC4
===

Mesa's ``vc4`` graphics driver supports multiple implementations of
Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0
through Raspberry Pi 3 hardware, and the driver is included as an
option as of the 2016-02-09 Rasbpian release using ``raspi-config``.
On most other distributions such as Debian or Fedora, you need no
configuration to enable the driver.

This Mesa driver talks directly to the `vc4
<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM
driver for scheduling graphics commands, and that module also provides
KMS display support.  The driver makes no use of the closed source VPU
firmware on the VideoCore IV block, instead talking directly to the
GPU block from Linux.

GLES2 support
-------------

The vc4 driver is a nearly conformant GLES2 driver, and the hardware
has achieved GLES2 conformance with other driver stacks.

OpenGL support
--------------

Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is
mostly correct but with a few caveats.

* 4-byte index buffers.

GLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support
them in vc4, we create a shadow copy of your index buffer with the
indices truncated to 2 bytes. This is incorrect (and will assertion
fail in debug builds of Mesa) if any of the indices were >65535. To
fix that, we would need to detect this case and rewrite the index
buffer and vertex buffers to do a series of draws each with small
indices and new vertex attrib bindings.

To avoid this problem, ensure that all index buffers are written using
``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls
with updated vertex attrib bindings.

* Occlusion queries

The VC4 hardware has no support for occlusion queries.  GL 2.0
requires that you support the occlusion queries extension, but you can
report 0 from ``glGetQueryiv(GL_SAMPLES_PASSED,
GL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles
"we want the functions to be present everywhere, but we want it to be
optional for hardware to support it. Sadly, gallium doesn't yet allow
the driver to report 0 query bits.

* Primitive mode

VC4 doesn't support reducing triangles/quads/polygons to lines and
points like desktop GL. If front/back mode matched, we could rewrite
the index buffer to the new primitive type, but we don't. If
front/back mode don't match, we would need to run the vertex shader in
software, classify the prims, write new index buffers, and emit
(possibly many) new draw calls to rasterize the new prims in the same
order.

Bug Reporting
-------------

VC4 rendering bugs should go to Mesa's gitlab `issues
<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page.

By far the easiest way to communicate bug reports for rendering
problems is to take an apitrace. This passes exactly the drawing you
saw to the developer, without the developer needing to download and
build the application and replicate whatever steps you took to produce
the problem.  Traces attached to bug reports should ideally be small.

For GPU hangs, if you can get a short apitrace that produces the
problem, that's still the best.  If the problem takes a long time to
reproduce or you can't capture it in a trace, describing how to
reproduce and including a gpu hang dump would be the most
useful. Install `vc4-gpu-tools
<https://github.com/anholt/vc4-gpu-tools/>` and use
``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will
provide useful information.

Tiled Rendering
---------------

VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or
32x32 (MSAA) tiles and rendering the scene per tile. Rasterization
looks like::

    (CPU) Allocate space to store a list of draw commands per tile
    (CPU) Set up a command list per tile that does:
        Either load the current tile's color buffer from memory, or clear it.
        Either load the current tile's depth buffer from memory, or clear it.
        Branch into the draw list for the tile
        Store the depth buffer if anybody might read it.
        Store the color buffer if anybody might read it.
    (GPU) Initialize the per-tile draw call lists to empty.
    (GPU) Run all draw calls collecting vertex data
    (GPU) For each tile covered by a draw call's primitive.
        Emit state packets to the list to update it to the current draw call's state.
        Emit a primitive description into the tile's draw call list.

Tiled rendering avoids the need for large render target caches, at the
expense of increasing the cost of vertex processing. Unlike some tiled
renderers, VC4 has no non-tiled rendering mode.

Performance Tricks
------------------

* Reducing memory bandwidth by clearing.

Even if your drawing is going to cover the entire render target, it's
more efficient for VC4 if you emit a ``glClear()`` of the color and
depth buffers. This means we can skip the load of the previous state
from memory, in favor of a cheap GPU-side ``memset()`` of the tile
buffer before we start running the draw calls.

* Reducing memory bandwidth with scissoring.

If all draw calls for the frame are with a ``glScissor()`` to only
part of the screen, then we can skip setting up the tiles for that
area, which means a little less memory used setting up the empty bins,
and a lot less memory used loading/storing the unchanged tiles.

* Reducing memory bandwidth with ``glInvalidateFramebuffer()``.

If we don't know who might use the contents of the framebuffer's depth
or color in the future, then we have to store it for later. If you use
glInvalidateFramebuffer() before accessing the results of your
rendering, then we can skip the store of the depth or color
buffer. Note that this is unimplemented.

* Avoid non-constant GLSL array indexing

In VC4 the only non-constant-index array access supported in hardware
is uniforms. For everything else (inputs, outputs, temporaries), we
have to lower them to an IF ladder like::

  if (index == 0)
     return array[0]
  else if (index == 1)
    return array[1]
  ...

This is very expensive as we probably have to execute every branch of
every IF statement due to it being a SIMD machine. So, it is
recommended (if you can) to avoid non-uniform non-constant array
indexing.

Note that if you do variable indexing within a bounded loop that Mesa
can unroll, that can actually count as constant indexing.

* Increasing GPU memory Increase CMA pool size

The memory for the VC4 driver is allocated from the standard Linux cma
pool. The size of this pool defaults to 64 MB.  To increase this, pass
an additional parameter on the kernel command line.  Edit the boot
partition's ``cmdline.txt`` to add::

  cma=256M@256M

``cmdline.txt`` is a single line with whitespace separated parameters.

The first value is the size of the pool and the second parameter is
the start address of the pool. The pool size can be increased further,
but it must fit into the memory, so size + start address must be below
1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this
reduces the memory available to Linux.

* Decrease firmware memory

The firmware allocates a fixed chunk of memory before booting
Linux. If firmware functions are not required, this amount can be
reduced.

In ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding,
edit gpu_mem to 64 if you need video decoding.

Performance debugging
---------------------

* Step 1: Known issues

The first tool to look at is running your application with the
environment variable ``VC4_DEBUG=perf`` set. This will report debug
information for many known causes of performance problems on the
console. Not all of them will cause visible performance improvements
when fixed, but it's a good first step to see what might going wrong.

* Step 2: CPU vs GPU

The primary question is figuring out whether the CPU is busy in your
application, the CPU is busy in the GL driver, the GPU is waiting for
the CPU, or the CPU is waiting for the GPU. Ideally, you get to the
point where the CPU is waiting for the GPU infrequently but for a
significant amount of time (however long it takes the GPU to draw a
frame).

Start with top while your application is running. Is the CPU usage
around 90%+? If so, then our performance analysis will be with
sysprof. If it's not very high, is the GPU staying busy? We don't have
a clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be
useful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that
means that the GPU is currently busy processing some rendering job.

* sysprof for CPU usage

If the CPU is totally busy and the GPU isn't terribly busy, there is
an excellent tool for debugging: sysprof. Install, run as root (so you
can get system-wide profiling), hit play and later stop. The top-left
area shows the flat profile sorted by total time of that symbol plus
its descendants. The top few are generally uninteresting (main() and
its descendants consuming a lot), but eventually you can get down to
something interesting. Click it, and to the right you get the
callchains to descendants -- where all that time actually went. On the
other hand, the lower left shows callers -- double-clicking those
selects that as the symbol to view, instead.

Note that you need debug symbols for the callgraphs in sysprof to
work, which is where most of its value is. Most distributions offer
debug symbol packages from their builds which can be installed
separately, and sysprof will find them. I've found that on arm, the
debug packages are not enough, and if someone could determine what is
necessary for callgraphs in debugging, that would be really helpful.

* perf for CPU waits on GPU

If the CPU is not very busy and the GPU is not very busy, then we're
probably ping-ponging between the two. Most cases of this would be
noticed by ``VC4_DEBUG=perf``, but not all. To see all cases where
this happens, use the perf tool from the Linux kernel (note: unrelated
to ``VC4_DEBUG=perf``)::

    sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena

If you want to see the whole system's stalls for a period of time
(very useful!), use the -a flag instead of a particular command
name. Just ``^C`` when you're done capturing data.

At exit, you'll have ``perf.data`` in the current directory. You can print
out the results with::

    perf report | less

* Debugging for GPU fully busy

As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's
performance counters in OpenGL. Install apitrace, and trace your
application with::

    apitrace trace <application>          # for GLX applications
    apitrace trace -a egl <application>   # for EGL applications

Once you've captured a trace, you can see what counters are available
and replay it while looking while looking at some of those counters::

    apitrace replay <application>.trace --list-metrics

    apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading

Multiple counters can be captured at once with commas separating them.

Once you've found what draw calls are surprisingly expensive in one of
the counters, you can work out which ones they were at the GL level by
opening the trace up in qapitrace and using ``^-G`` to jump to that call
number and ``^-L`` to look up the GL state at that call.

shader-db
---------

shader-db is often used as a proxy for real-world app performance when
working on the compiler in Mesa.  On vc4, there is a lot of
state-dependent code in the shaders (like blending or vertex attribute
format handling), so the typical `shader-db
<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important
areas for optimization.  Instead, anholt wrote a `new one
<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on
apitraces.  Once you have a collection of traces, starting from
`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__,
you can test a compiler change in this shader-db with::

  ./run.py > before
  (cd ../mesa && make install)
  ./run.py > after
  ./report.py before after

Hardware Documentation
----------------------

For driver developers, Broadcom publicly released a `specification
<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which
is closely related to the vc4 GPU present in the Raspberry Pi.  They
also released a `snapshot <https://docs.broadcom.com/docs/12358546>`__
of a corresponding Android graphics driver.  That graphics driver was
ported to Raspbian for a demo, but was not expected to have ongoing
development.

Developers with NDA access with Broadcom or Raspberry Pi can
potentially get access to "simpenrose", the C software simulator of
the GPU.  The Mesa driver includes a backend (`vc4_simulator.c`) to
use simpenrose from an x86 system with the i915 graphics driver with
all of the vc4 rendering commands emulated on simpenrose and memcpyed
to the real GPU.