1=====================
2Adreno Five Microcode
3=====================
4
5.. contents::
6
7.. _afuc-introduction:
8
9Introduction
10============
11
12Adreno GPUs prior to 6xx use two micro-controllers to parse the command-stream,
13setup the hardware for draws (or compute jobs), and do various GPU
14housekeeping.  They are relatively simple (basically glorified
15register writers) and basically all their state is in a collection
16of registers.  Ie. there is no stack, and no memory assigned to
17them; any global state like which bank of context registers is to
18be used in the next draw is stored in a register.
19
20The setup is similar to radeon, in fact Adreno 2xx thru 4xx used
21basically the same instruction set as r600.  There is a "PFP"
22(Prefetch Parser) and "ME" (Micro Engine, also confusingly referred
23to as "PM4").  These make up the "CP" ("Command Parser").  The
24PFP runs ahead of the ME, with some PM4 packets handled entirely
25in the PFP.  Between the PFP and ME is a FIFO ("MEQ").  In the
26generations prior to Adreno 5xx, the PFP and ME had different
27instruction sets.
28
29Starting with Adreno 5xx, a new microcontroller with a unified
30instruction set was introduced, although the overall architecture
31and purpose of the two microcontrollers remains the same.
32
33For lack of a better name, this new instruction set is called
34"Adreno Five MicroCode" or "afuc".  (No idea what Qualcomm calls
35it internally.
36
37With Adreno 6xx, the separate PF and ME are replaced with a single
38SQE microcontroller using the same instruction set as 5xx.
39
40.. _afuc-overview:
41
42Instruction Set Overview
43========================
44
4532bit instruction set with basic arithmatic ops that can take
46either two source registers or one src and a 16b immediate.
47
4832 registers, although some are special purpose:
49
50- ``$00`` - always reads zero, otherwise seems to be the PC
51- ``$01`` - current PM4 packet header
52- ``$1c`` - alias ``$rem``, remaining data in packet
53- ``$1d`` - alias ``$addr``
54- ``$1f`` - alias ``$data``
55
56Branch instructions have a delay slot so the following instruction
57is always executed regardless of whether branch is taken or not.
58
59
60.. _afuc-alu:
61
62ALU Instructions
63================
64
65The following instructions are available:
66
67- ``add``   - add
68- ``addhi`` - add + carry (for upper 32b of 64b value)
69- ``sub``   - subtract
70- ``subhi`` - subtract + carry (for upper 32b of 64b value)
71- ``and``   - bitwise AND
72- ``or``    - bitwise OR
73- ``xor``   - bitwise XOR
74- ``not``   - bitwise NOT (no src1)
75- ``shl``   - shift-left
76- ``ushr``  - unsigned shift-right
77- ``ishr``  - signed shift-right
78- ``rot``   - rotate-left (like shift-left with wrap-around)
79- ``mul8``  - multiply low 8b of two src
80- ``min``   - minimum
81- ``max``   - maximum
82- ``comp``  - compare two values
83
84The ALU instructions can take either two src registers, or a src
85plus 16b immediate as 2nd src, ex::
86
87  add $dst, $src, 0x1234   ; src2 is immed
88  add $dst, $src1, $src2   ; src2 is reg
89
90The ``not`` instruction only takes a single source::
91
92  not $dst, $src
93  not $dst, 0x1234
94
95.. _afuc-alu-cmp:
96
97The ``cmp`` instruction returns:
98
99- ``0x00`` if src1 > src2
100- ``0x2b`` if src1 == src2
101- ``0x1e`` if src1 < src2
102
103See explanation in :ref:`afuc-branch`
104
105
106.. _afuc-branch:
107
108Branch Instructions
109===================
110
111The following branch/jump instructions are available:
112
113- ``brne`` - branch if not equal (or bit not set)
114- ``breq`` - branch if equal (or bit set)
115- ``jump`` - unconditional jump
116
117Both ``brne`` and ``breq`` have two forms, comparing the src register
118against either a small immediate (up to 5 bits) or a specific bit::
119
120  breq $src, b3, #somelabel  ; branch if src & (1 << 3)
121  breq $src, 0x3, #somelabel ; branch if src == 3
122
123The branch instructions are encoded with a 16b relative offset.
124Since ``$00`` always reads back zero, it can be used to construct
125an unconditional relative jump.
126
127The :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the
128bit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le,
129due to the bit pattern it returns, for example::
130
131  cmp $04, $02, $03
132  breq $04, b1, #somelabel
133
134will branch if ``$02`` is less than or equal to ``$03``.
135
136
137.. _afuc-call:
138
139Call/Return
140===========
141
142Simple subroutines can be implemented with ``call``/``ret``.  The
143jump instruction encodes a fixed offset.
144
145  TODO not sure how many levels deep function calls can be nested.
146  There isn't really a stack.  Definitely seems to be multiple
147  levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 ->
148  f22.
149
150
151.. _afuc-control:
152
153Config Instructions
154===================
155
156These seem to read/write config state in other parts of CP.  In at
157least some cases I expect these map to CP registers (but possibly
158not directly??)
159
160- ``cread $dst, [$off + addr], flags``
161- ``cwrite $src, [$off + addr], flags``
162
163In cases where no offset is needed, ``$00`` is frequently used as
164the offset.
165
166For example, the following sequences sets::
167
168  ; load CP_INDIRECT_BUFFER parameters from cmdstream:
169  mov $02, $data   ; low 32b of IB target address
170  mov $03, $data   ; high 32b of IB target
171  mov $04, $data   ; IB size in dwords
172
173  ; sanity check # of dwords:
174  breq $04, 0x0, #l23 (#69, 04a2)
175
176  ; this seems something to do with figuring out whether
177  ; we are going from RB->IB1 or IB1->IB2 (ie. so the
178  ; below cwrite instructions update either
179  ; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE
180  and $05, $18, 0x0003
181  shl $05, $05, 0x0002
182
183  ; update CP_IBn_BASE_LO/HI/BUFSIZE:
184  cwrite $02, [$05 + 0x0b0], 0x8
185  cwrite $03, [$05 + 0x0b1], 0x8
186  cwrite $04, [$05 + 0x0b2], 0x8
187
188
189
190.. _afuc-reg-access:
191
192Register Access
193===============
194
195The special registers ``$addr`` and ``$data`` can be used to write GPU
196registers, for example, to write::
197
198  mov $addr, CP_SCRATCH_REG[0x2] ; set register to write
199  mov $data, $03                 ; CP_SCRATCH_REG[0x2]
200  mov $data, $04                 ; CP_SCRATCH_REG[0x3]
201  ...
202
203subsequent writes to ``$data`` will increment the address of the register
204to write, so a sequence of consecutive registers can be written
205
206To read::
207
208  mov $addr, CP_SCRATCH_REG[0x2]
209  mov $03, $addr
210  mov $04, $addr
211
212Many registers that are updated frequently have two banks, so they can be
213updated without stalling for previous draw to finish.  These banks are
214arranged so bit 11 is zero for bank 0 and 1 for bank 1.  The ME fw (at
215least the version I'm looking at) stores this in ``$17``, so to update
216these registers from ME::
217
218  or $addr, $17, VFD_INDEX_OFFSET
219  mov $data, $03
220  ...
221
222Note that PFP doesn't seem to use this approach, instead it does something
223like::
224
225  mov $0c, CP_SCRATCH_REG[0x7]
226  mov $02, 0x789a   ; value
227  cwrite $0c, [$00 + 0x010], 0x8
228  cwrite $02, [$00 + 0x011], 0x8
229
230Like with the ``$addr``/``$data`` approach, the destination register address
231increments on each write.
232
233.. _afuc-mem:
234
235Memory Access
236=============
237
238There are no load/store instructions, as such.  The microcontrollers
239have only indirect memory access via GPU registers.  There are two
240mechanism possible.
241
242Read/Write via CP_NRT Registers
243-------------------------------
244
245This seems to be only used by ME.  If PFP were also using it, they would
246race with each other.  It seems to be primarily used for small reads.
247
248- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write
249- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``
250
251The address register increments with successive reads or writes.
252
253Memory Write example::
254
255  ; store 64b value in $04+$05 to 64b address in $02+$03
256  mov $addr, CP_ME_NRT_ADDR_LO
257  mov $data, $02
258  mov $data, $03
259  mov $addr, CP_ME_NRT_DATA
260  mov $data, $04
261  mov $data, $05
262
263Memory Read example::
264
265  ; load 64b value from address in $02+$03 into $04+$05
266  mov $addr, CP_ME_NRT_ADDR_LO
267  mov $data, $02
268  mov $data, $03
269  mov $04, $addr
270  mov $05, $addr
271
272
273Read via Control Instructions
274-----------------------------
275
276This is used by PFP whenever it needs to read memory.  Also seems to be
277used by ME for streaming reads (larger amounts of data).  The DMA access
278seems to be done by ROQ.
279
280  TODO might also be possible for write access
281
282  TODO some of the control commands might be synchronizing access
283  between PFP and ME??
284
285An example from ``CP_DRAW_INDIRECT`` packet handler::
286
287  mov $07, 0x0004  ; # of dwords to read from draw-indirect buffer
288  ; load address of indirect buffer from cmdstream:
289  cwrite $data, [$00 + 0x0b8], 0x8
290  cwrite $data, [$00 + 0x0b9], 0x8
291  ; set # of dwords to read:
292  cwrite $07, [$00 + 0x0ba], 0x8
293  ...
294  ; read parameters from draw-indirect buffer:
295  mov $09, $addr
296  mov $07, $addr
297  cread $12, [$00 + 0x040], 0x8
298  ; the start parameter gets written into MEQ, which ME writes
299  ; to VFD_INDEX_OFFSET register:
300  mov $data, $addr
301
302
303A6XX NOTES
304==========
305
306The ``$14`` register holds global flags set by:
307
308  CP_SKIP_IB2_ENABLE_LOCAL - b8
309  CP_SKIP_IB2_ENABLE_GLOBAL - b9
310  CP_SET_MARKER
311    MODE=GMEM - sets b15
312    MODE=BLIT2D - clears b15, b12, b7
313  CP_SET_MODE - b29+b30
314  CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30?
315  CP_SET_DRAW_STATE - checks b29+b30
316
317  CP_COND_REG_EXEC - checks b10, which should be predicate flag?
318