1=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6   :local:
7
8Introduction
9============
10
11The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
14
15LLVM
16====
17
18.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
26  .. table:: AMDGPU Architectures
27     :name: amdgpu-architecture-table
28
29     ============ ==============================================================
30     Architecture Description
31     ============ ==============================================================
32     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34     ============ ==============================================================
35
36  .. table:: AMDGPU Vendors
37     :name: amdgpu-vendor-table
38
39     ============ ==============================================================
40     Vendor       Description
41     ============ ==============================================================
42     ``amd``      Can be used for all AMD GPU usage.
43     ``mesa3d``   Can be used if the OS is ``mesa3d``.
44     ============ ==============================================================
45
46  .. table:: AMDGPU Operating Systems
47     :name: amdgpu-os-table
48
49     ============== ============================================================
50     OS             Description
51     ============== ============================================================
52     *<empty>*      Defaults to the *unknown* OS.
53     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
54                    such as AMD's ROCm [AMD-ROCm]_.
55     ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
56                    runtime.
57     ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
58                    runtime.
59     ============== ============================================================
60
61  .. table:: AMDGPU Environments
62     :name: amdgpu-environment-table
63
64     ============ ==============================================================
65     Environment  Description
66     ============ ==============================================================
67     *<empty>*    Default.
68     ============ ==============================================================
69
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78  .. table:: AMDGPU Processors
79     :name: amdgpu-processor-table
80
81     =========== =============== ============ ===== ========= ======= ==================
82     Processor   Alternative     Target       dGPU/ Target    ROCm    Example
83                 Processor       Triple       APU   Features  Support Products
84                                 Architecture       Supported
85                                                    [Default]
86     =========== =============== ============ ===== ========= ======= ==================
87     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
88     -----------------------------------------------------------------------------------
89     ``r600``                    ``r600``     dGPU
90     ``r630``                    ``r600``     dGPU
91     ``rs880``                   ``r600``     dGPU
92     ``rv670``                   ``r600``     dGPU
93     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
94     -----------------------------------------------------------------------------------
95     ``rv710``                   ``r600``     dGPU
96     ``rv730``                   ``r600``     dGPU
97     ``rv770``                   ``r600``     dGPU
98     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
99     -----------------------------------------------------------------------------------
100     ``cedar``                   ``r600``     dGPU
101     ``cypress``                 ``r600``     dGPU
102     ``juniper``                 ``r600``     dGPU
103     ``redwood``                 ``r600``     dGPU
104     ``sumo``                    ``r600``     dGPU
105     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
106     -----------------------------------------------------------------------------------
107     ``barts``                   ``r600``     dGPU
108     ``caicos``                  ``r600``     dGPU
109     ``cayman``                  ``r600``     dGPU
110     ``turks``                   ``r600``     dGPU
111     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
112     -----------------------------------------------------------------------------------
113     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
114     ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
115                 - ``oland``
116                 - ``pitcairn``
117                 - ``verde``
118     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
119     -----------------------------------------------------------------------------------
120     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
121                                                                      - A6 Pro-7050B
122                                                                      - A8-7100
123                                                                      - A8 Pro-7150B
124                                                                      - A10-7300
125                                                                      - A10 Pro-7350B
126                                                                      - FX-7500
127                                                                      - A8-7200P
128                                                                      - A10-7400P
129                                                                      - FX-7600P
130     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
131                                                                      - FirePro W9100
132                                                                      - FirePro S9150
133                                                                      - FirePro S9170
134     ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
135                                                                      - Radeon R9 290x
136                                                                      - Radeon R390
137                                                                      - Radeon R390x
138     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
139                 - ``mullins``                                        - E1-2200
140                                                                      - E1-2500
141                                                                      - E2-3000
142                                                                      - E2-3800
143                                                                      - A4-5000
144                                                                      - A4-5100
145                                                                      - A6-5200
146                                                                      - A4 Pro-3340B
147     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
148                                                                      - Radeon HD 8770
149                                                                      - R7 260
150                                                                      - R7 260X
151     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
152     -----------------------------------------------------------------------------------
153     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
154                                                      [on]            - Pro A6-8500B
155                                                                      - A8-8600P
156                                                                      - Pro A8-8600B
157                                                                      - FX-8800P
158                                                                      - Pro A12-8800B
159     \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
160                                                      [on]            - Pro A10-8700B
161                                                                      - A10-8780P
162     \                           ``amdgcn``   APU   - xnack           - A10-9600P
163                                                      [on]            - A10-9630P
164                                                                      - A12-9700P
165                                                                      - A12-9730P
166                                                                      - FX-9800P
167                                                                      - FX-9830P
168     \                           ``amdgcn``   APU   - xnack           - E2-9010
169                                                      [on]            - A6-9210
170                                                                      - A9-9410
171     ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
172                 - ``tonga``                          [off]           - FirePro S7100
173                                                                      - FirePro W7100
174                                                                      - Radeon R285
175                                                                      - Radeon R9 380
176                                                                      - Radeon R9 385
177                                                                      - Mobile FirePro
178                                                                        M7170
179     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
180                                                      [off]           - Radeon R9 Fury
181                                                                      - Radeon R9 FuryX
182                                                                      - Radeon Pro Duo
183                                                                      - FirePro S9300x2
184                                                                      - Radeon Instinct MI8
185     \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
186                                                      [off]           - Radeon RX 480
187                                                                      - Radeon Instinct MI6
188     \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
189                                                      [off]
190     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
191                                                      [on]
192     **GCN GFX9** [AMD-GCN-GFX9]_
193     -----------------------------------------------------------------------------------
194     ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
195                                                      [off]             Frontier Edition
196                                                                      - Radeon RX Vega 56
197                                                                      - Radeon RX Vega 64
198                                                                      - Radeon RX Vega 64
199                                                                        Liquid
200                                                                      - Radeon Instinct MI25
201     ``gfx902``                  ``amdgcn``   APU   - xnack           - Ryzen 3 2200G
202                                                      [on]            - Ryzen 5 2400G
203     ``gfx904``                  ``amdgcn``   dGPU  - xnack           *TBA*
204                                                      [off]
205                                                                      .. TODO
206                                                                         Add product
207                                                                         names.
208     ``gfx906``                  ``amdgcn``   dGPU  - xnack           *TBA*
209                                                      [off]
210                                                                      .. TODO
211                                                                         Add product
212                                                                         names.
213     =========== =============== ============ ===== ========= ======= ==================
214
215.. _amdgpu-target-features:
216
217Target Features
218---------------
219
220Target features control how code is generated to support certain
221processor specific features. Not all target features are supported by
222all processors. The runtime must ensure that the features supported by
223the device used to execute the code match the features enabled when
224generating the code. A mismatch of features may result in incorrect
225execution, or a reduction in performance.
226
227The target features supported by each processor, and the default value
228used if not specified explicitly, is listed in
229:ref:`amdgpu-processor-table`.
230
231Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
232target features.
233
234For example:
235
236``-mxnack``
237  Enable the ``xnack`` feature.
238``-mno-xnack``
239  Disable the ``xnack`` feature.
240
241  .. table:: AMDGPU Target Features
242     :name: amdgpu-target-feature-table
243
244     ============== ==================================================
245     Target Feature Description
246     ============== ==================================================
247     -m[no-]xnack   Enable/disable generating code that has
248                    memory clauses that are compatible with
249                    having XNACK replay enabled.
250
251                    This is used for demand paging and page
252                    migration. If XNACK replay is enabled in
253                    the device, then if a page fault occurs
254                    the code may execute incorrectly if the
255                    ``xnack`` feature is not enabled. Executing
256                    code that has the feature enabled on a
257                    device that does not have XNACK replay
258                    enabled will execute correctly, but may
259                    be less performant than code with the
260                    feature disabled.
261     ============== ==================================================
262
263.. _amdgpu-address-spaces:
264
265Address Spaces
266--------------
267
268The AMDGPU backend uses the following address space mappings.
269
270The memory space names used in the table, aside from the region memory space, is
271from the OpenCL standard.
272
273LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
274
275  .. table:: Address Space Mapping
276     :name: amdgpu-address-space-mapping-table
277
278     ================== =================
279     LLVM Address Space Memory Space
280     ================== =================
281     0                  Generic (Flat)
282     1                  Global
283     2                  Region (GDS)
284     3                  Local (group/LDS)
285     4                  Constant
286     5                  Private (Scratch)
287     6                  Constant 32-bit
288     ================== =================
289
290.. _amdgpu-memory-scopes:
291
292Memory Scopes
293-------------
294
295This section provides LLVM memory synchronization scopes supported by the AMDGPU
296backend memory model when the target triple OS is ``amdhsa`` (see
297:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
298
299The memory model supported is based on the HSA memory model [HSA]_ which is
300based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
301relation is transitive over the synchonizes-with relation independent of scope,
302and synchonizes-with allows the memory scope instances to be inclusive (see
303table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
304
305This is different to the OpenCL [OpenCL]_ memory model which does not have scope
306inclusion and requires the memory scopes to exactly match. However, this
307is conservatively correct for OpenCL.
308
309  .. table:: AMDHSA LLVM Sync Scopes
310     :name: amdgpu-amdhsa-llvm-sync-scopes-table
311
312     ================ ==========================================================
313     LLVM Sync Scope  Description
314     ================ ==========================================================
315     *none*           The default: ``system``.
316
317                      Synchronizes with, and participates in modification and
318                      seq_cst total orderings with, other operations (except
319                      image operations) for all address spaces (except private,
320                      or generic that accesses private) provided the other
321                      operation's sync scope is:
322
323                      - ``system``.
324                      - ``agent`` and executed by a thread on the same agent.
325                      - ``workgroup`` and executed by a thread in the same
326                        workgroup.
327                      - ``wavefront`` and executed by a thread in the same
328                        wavefront.
329
330     ``agent``        Synchronizes with, and participates in modification and
331                      seq_cst total orderings with, other operations (except
332                      image operations) for all address spaces (except private,
333                      or generic that accesses private) provided the other
334                      operation's sync scope is:
335
336                      - ``system`` or ``agent`` and executed by a thread on the
337                        same agent.
338                      - ``workgroup`` and executed by a thread in the same
339                        workgroup.
340                      - ``wavefront`` and executed by a thread in the same
341                        wavefront.
342
343     ``workgroup``    Synchronizes with, and participates in modification and
344                      seq_cst total orderings with, other operations (except
345                      image operations) for all address spaces (except private,
346                      or generic that accesses private) provided the other
347                      operation's sync scope is:
348
349                      - ``system``, ``agent`` or ``workgroup`` and executed by a
350                        thread in the same workgroup.
351                      - ``wavefront`` and executed by a thread in the same
352                        wavefront.
353
354     ``wavefront``    Synchronizes with, and participates in modification and
355                      seq_cst total orderings with, other operations (except
356                      image operations) for all address spaces (except private,
357                      or generic that accesses private) provided the other
358                      operation's sync scope is:
359
360                      - ``system``, ``agent``, ``workgroup`` or ``wavefront``
361                        and executed by a thread in the same wavefront.
362
363     ``singlethread`` Only synchronizes with, and participates in modification
364                      and seq_cst total orderings with, other operations (except
365                      image operations) running in the same thread for all
366                      address spaces (for example, in signal handlers).
367     ================ ==========================================================
368
369AMDGPU Intrinsics
370-----------------
371
372The AMDGPU backend implements the following LLVM IR intrinsics.
373
374*This section is WIP.*
375
376.. TODO
377   List AMDGPU intrinsics
378
379AMDGPU Attributes
380-----------------
381
382The AMDGPU backend supports the following LLVM IR attributes.
383
384  .. table:: AMDGPU LLVM IR Attributes
385     :name: amdgpu-llvm-ir-attributes-table
386
387     ======================================= ==========================================================
388     LLVM Attribute                          Description
389     ======================================= ==========================================================
390     "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
391                                             will be specified when the kernel is dispatched. Generated
392                                             by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
393     "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
394                                             argument block size for the implicit arguments. This
395                                             varies by OS and language (for OpenCL see
396                                             :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
397     "amdgpu-max-work-group-size"="n"        Specify the maximum work-group size that will be specifed
398                                             when the kernel is dispatched.
399     "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
400                                             the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
401     "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
402                                             ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
403     "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
404                                             execution unit. Generated by the ``amdgpu_waves_per_eu``
405                                             CLANG attribute [CLANG-ATTR]_.
406     ======================================= ==========================================================
407
408Code Object
409===========
410
411The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
412can be linked by ``lld`` to produce a standard ELF shared code object which can
413be loaded and executed on an AMDGPU target.
414
415Header
416------
417
418The AMDGPU backend uses the following ELF header:
419
420  .. table:: AMDGPU ELF Header
421     :name: amdgpu-elf-header-table
422
423     ========================== ===============================
424     Field                      Value
425     ========================== ===============================
426     ``e_ident[EI_CLASS]``      ``ELFCLASS64``
427     ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
428     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
429                                - ``ELFOSABI_AMDGPU_HSA``
430                                - ``ELFOSABI_AMDGPU_PAL``
431                                - ``ELFOSABI_AMDGPU_MESA3D``
432     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
433                                - ``ELFABIVERSION_AMDGPU_PAL``
434                                - ``ELFABIVERSION_AMDGPU_MESA3D``
435     ``e_type``                 - ``ET_REL``
436                                - ``ET_DYN``
437     ``e_machine``              ``EM_AMDGPU``
438     ``e_entry``                0
439     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
440     ========================== ===============================
441
442..
443
444  .. table:: AMDGPU ELF Header Enumeration Values
445     :name: amdgpu-elf-header-enumeration-values-table
446
447     =============================== =====
448     Name                            Value
449     =============================== =====
450     ``EM_AMDGPU``                   224
451     ``ELFOSABI_NONE``               0
452     ``ELFOSABI_AMDGPU_HSA``         64
453     ``ELFOSABI_AMDGPU_PAL``         65
454     ``ELFOSABI_AMDGPU_MESA3D``      66
455     ``ELFABIVERSION_AMDGPU_HSA``    1
456     ``ELFABIVERSION_AMDGPU_PAL``    0
457     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
458     =============================== =====
459
460``e_ident[EI_CLASS]``
461  The ELF class is:
462
463  * ``ELFCLASS32`` for ``r600`` architecture.
464
465  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
466    bit applications.
467
468``e_ident[EI_DATA]``
469  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
470
471``e_ident[EI_OSABI]``
472  One of the following AMD GPU architecture specific OS ABIs
473  (see :ref:`amdgpu-os-table`):
474
475  * ``ELFOSABI_NONE`` for *unknown* OS.
476
477  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
478
479  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
480
481  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
482
483``e_ident[EI_ABIVERSION]``
484  The ABI version of the AMD GPU architecture specific OS ABI to which the code
485  object conforms:
486
487  * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
488    runtime ABI.
489
490  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
491    runtime ABI.
492
493  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
494    3D runtime ABI.
495
496``e_type``
497  Can be one of the following values:
498
499
500  ``ET_REL``
501    The type produced by the AMD GPU backend compiler as it is relocatable code
502    object.
503
504  ``ET_DYN``
505    The type produced by the linker as it is a shared code object.
506
507  The AMD HSA runtime loader requires a ``ET_DYN`` code object.
508
509``e_machine``
510  The value ``EM_AMDGPU`` is used for the machine for all processors supported
511  by the ``r600`` and ``amdgcn`` architectures (see
512  :ref:`amdgpu-processor-table`). The specific processor is specified in the
513  ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
514  :ref:`amdgpu-elf-header-e_flags-table`).
515
516``e_entry``
517  The entry point is 0 as the entry points for individual kernels must be
518  selected in order to invoke them through AQL packets.
519
520``e_flags``
521  The AMDGPU backend uses the following ELF header flags:
522
523  .. table:: AMDGPU ELF Header ``e_flags``
524     :name: amdgpu-elf-header-e_flags-table
525
526     ================================= ========== =============================
527     Name                              Value      Description
528     ================================= ========== =============================
529     **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
530     -------------------------------------------- -----------------------------
531     ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
532                                                  mask for
533                                                  ``EF_AMDGPU_MACH_xxx`` values
534                                                  defined in
535                                                  :ref:`amdgpu-ef-amdgpu-mach-table`.
536     ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
537                                                  target feature is
538                                                  enabled for all code
539                                                  contained in the code object.
540                                                  If the processor
541                                                  does not support the
542                                                  ``xnack`` target
543                                                  feature then must
544                                                  be 0.
545                                                  See
546                                                  :ref:`amdgpu-target-features`.
547     ================================= ========== =============================
548
549  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
550     :name: amdgpu-ef-amdgpu-mach-table
551
552     ================================= ========== =============================
553     Name                              Value      Description (see
554                                                  :ref:`amdgpu-processor-table`)
555     ================================= ========== =============================
556     ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
557     ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
558     ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
559     ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
560     ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
561     ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
562     ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
563     ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
564     ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
565     ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
566     ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
567     ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
568     ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
569     ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
570     ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
571     ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
572     ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
573     *reserved*                        0x011 -    Reserved for ``r600``
574                                       0x01f      architecture processors.
575     ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
576     ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
577     ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
578     ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
579     ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
580     ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
581     ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
582     *reserved*                        0x027      Reserved.
583     ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
584     ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
585     ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
586     ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
587     ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
588     ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
589     ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904``
590     ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906``
591     *reserved*                        0x030      Reserved.
592     ================================= ========== =============================
593
594Sections
595--------
596
597An AMDGPU target ELF code object has the standard ELF sections which include:
598
599  .. table:: AMDGPU ELF Sections
600     :name: amdgpu-elf-sections-table
601
602     ================== ================ =================================
603     Name               Type             Attributes
604     ================== ================ =================================
605     ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
606     ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
607     ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
608     ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
609     ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
610     ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
611     ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
612     ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
613     ``.note``          ``SHT_NOTE``     *none*
614     ``.rela``\ *name*  ``SHT_RELA``     *none*
615     ``.rela.dyn``      ``SHT_RELA``     *none*
616     ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
617     ``.shstrtab``      ``SHT_STRTAB``   *none*
618     ``.strtab``        ``SHT_STRTAB``   *none*
619     ``.symtab``        ``SHT_SYMTAB``   *none*
620     ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
621     ================== ================ =================================
622
623These sections have their standard meanings (see [ELF]_) and are only generated
624if needed.
625
626``.debug``\ *\**
627  The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
628  DWARF produced by the AMDGPU backend.
629
630``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
631  The standard sections used by a dynamic loader.
632
633``.note``
634  See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
635  backend.
636
637``.rela``\ *name*, ``.rela.dyn``
638  For relocatable code objects, *name* is the name of the section that the
639  relocation records apply. For example, ``.rela.text`` is the section name for
640  relocation records associated with the ``.text`` section.
641
642  For linked shared code objects, ``.rela.dyn`` contains all the relocation
643  records from each of the relocatable code object's ``.rela``\ *name* sections.
644
645  See :ref:`amdgpu-relocation-records` for the relocation records supported by
646  the AMDGPU backend.
647
648``.text``
649  The executable machine code for the kernels and functions they call. Generated
650  as position independent code. See :ref:`amdgpu-code-conventions` for
651  information on conventions used in the isa generation.
652
653.. _amdgpu-note-records:
654
655Note Records
656------------
657
658As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
659be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
660aligned. In addition, minimal zero byte padding must be generated to ensure the
661``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
662``.note`` section must be at least 4 to indicate at least 8 byte alignment.
663
664The AMDGPU backend code object uses the following ELF note records in the
665``.note`` section. The *Description* column specifies the layout of the note
666record's ``desc`` field. All fields are consecutive bytes. Note records with
667variable size strings have a corresponding ``*_size`` field that specifies the
668number of bytes, including the terminating null character, in the string. The
669string(s) come immediately after the preceding fields.
670
671Additional note records can be present.
672
673  .. table:: AMDGPU ELF Note Records
674     :name: amdgpu-elf-note-records-table
675
676     ===== ============================== ======================================
677     Name  Type                           Description
678     ===== ============================== ======================================
679     "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
680     ===== ============================== ======================================
681
682..
683
684  .. table:: AMDGPU ELF Note Record Enumeration Values
685     :name: amdgpu-elf-note-record-enumeration-values-table
686
687     ============================== =====
688     Name                           Value
689     ============================== =====
690     *reserved*                       0-9
691     ``NT_AMD_AMDGPU_HSA_METADATA``    10
692     *reserved*                        11
693     ============================== =====
694
695``NT_AMD_AMDGPU_HSA_METADATA``
696  Specifies extensible metadata associated with the code objects executed on HSA
697  [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
698  the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
699  :ref:`amdgpu-amdhsa-code-object-metadata` for the syntax of the code
700  object metadata string.
701
702.. _amdgpu-symbols:
703
704Symbols
705-------
706
707Symbols include the following:
708
709  .. table:: AMDGPU ELF Symbols
710     :name: amdgpu-elf-symbols-table
711
712     ===================== ============== ============= ==================
713     Name                  Type           Section       Description
714     ===================== ============== ============= ==================
715     *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
716                                          - ``.rodata``
717                                          - ``.bss``
718     *link-name*\ ``.kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
719     *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
720     ===================== ============== ============= ==================
721
722Global variable
723  Global variables both used and defined by the compilation unit.
724
725  If the symbol is defined in the compilation unit then it is allocated in the
726  appropriate section according to if it has initialized data or is readonly.
727
728  If the symbol is external then its section is ``STN_UNDEF`` and the loader
729  will resolve relocations using the definition provided by another code object
730  or explicitly defined by the runtime.
731
732  All global symbols, whether defined in the compilation unit or external, are
733  accessed by the machine code indirectly through a GOT table entry. This
734  allows them to be preemptable. The GOT table is only supported when the target
735  triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
736
737  .. TODO
738     Add description of linked shared object symbols. Seems undefined symbols
739     are marked as STT_NOTYPE.
740
741Kernel descriptor
742  Every HSA kernel has an associated kernel descriptor. It is the address of the
743  kernel descriptor that is used in the AQL dispatch packet used to invoke the
744  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
745  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
746
747Kernel entry point
748  Every HSA kernel also has a symbol for its machine code entry point.
749
750.. _amdgpu-relocation-records:
751
752Relocation Records
753------------------
754
755AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
756relocatable fields are:
757
758``word32``
759  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
760  alignment. These values use the same byte order as other word values in the
761  AMD GPU architecture.
762
763``word64``
764  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
765  alignment. These values use the same byte order as other word values in the
766  AMD GPU architecture.
767
768Following notations are used for specifying relocation calculations:
769
770**A**
771  Represents the addend used to compute the value of the relocatable field.
772
773**G**
774  Represents the offset into the global offset table at which the relocation
775  entry's symbol will reside during execution.
776
777**GOT**
778  Represents the address of the global offset table.
779
780**P**
781  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
782  of the storage unit being relocated (computed using ``r_offset``).
783
784**S**
785  Represents the value of the symbol whose index resides in the relocation
786  entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
787
788**B**
789  Represents the base address of a loaded executable or shared object which is
790  the difference between the ELF address and the actual load address. Relocations
791  using this are only valid in executable or shared objects.
792
793The following relocation types are supported:
794
795  .. table:: AMDGPU ELF Relocation Records
796     :name: amdgpu-elf-relocation-records-table
797
798     ========================== ======= =====  ==========  ==============================
799     Relocation Type            Kind    Value  Field       Calculation
800     ========================== ======= =====  ==========  ==============================
801     ``R_AMDGPU_NONE``                  0      *none*      *none*
802     ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
803                                Dynamic
804     ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
805                                Dynamic
806     ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
807                                Dynamic
808     ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
809     ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
810     ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
811                                Dynamic
812     ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
813     ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
814     ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
815     ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
816     ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
817     *reserved*                         12
818     ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
819     ========================== ======= =====  ==========  ==============================
820
821``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
822the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
823
824There is no current OS loader support for 32 bit programs and so
825``R_AMDGPU_ABS32`` is not used.
826
827.. _amdgpu-dwarf:
828
829DWARF
830-----
831
832Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
833information that maps the code object executable code and data to the source
834language constructs. It can be used by tools such as debuggers and profilers.
835
836Address Space Mapping
837~~~~~~~~~~~~~~~~~~~~~
838
839The following address space mapping is used:
840
841  .. table:: AMDGPU DWARF Address Space Mapping
842     :name: amdgpu-dwarf-address-space-mapping-table
843
844     =================== =================
845     DWARF Address Space Memory Space
846     =================== =================
847     1                   Private (Scratch)
848     2                   Local (group/LDS)
849     *omitted*           Global
850     *omitted*           Constant
851     *omitted*           Generic (Flat)
852     *not supported*     Region (GDS)
853     =================== =================
854
855See :ref:`amdgpu-address-spaces` for information on the memory space terminology
856used in the table.
857
858An ``address_class`` attribute is generated on pointer type DIEs to specify the
859DWARF address space of the value of the pointer when it is in the *private* or
860*local* address space. Otherwise the attribute is omitted.
861
862An ``XDEREF`` operation is generated in location list expressions for variables
863that are allocated in the *private* and *local* address space. Otherwise no
864``XDREF`` is omitted.
865
866Register Mapping
867~~~~~~~~~~~~~~~~
868
869*This section is WIP.*
870
871.. TODO
872   Define DWARF register enumeration.
873
874   If want to present a wavefront state then should expose vector registers as
875   64 wide (rather than per work-item view that LLVM uses). Either as separate
876   registers, or a 64x4 byte single register. In either case use a new LANE op
877   (akin to XDREF) to select the current lane usage in a location
878   expression. This would also allow scalar register spilling to vector register
879   lanes to be expressed (currently no debug information is being generated for
880   spilling). If choose a wide single register approach then use LANE in
881   conjunction with PIECE operation to select the dword part of the register for
882   the current lane. If the separate register approach then use LANE to select
883   the register.
884
885Source Text
886~~~~~~~~~~~
887
888Source text for online-compiled programs (e.g. those compiled by the OpenCL
889runtime) may be embedded into the DWARF v5 line table using the ``clang
890-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
891
892For example:
893
894``-gembed-source``
895  Enable the embedded source DWARF v5 extension.
896``-gno-embed-source``
897  Disable the embedded source DWARF v5 extension.
898
899  .. table:: AMDGPU Debug Options
900     :name: amdgpu-debug-options
901
902     ==================== ==================================================
903     Debug Flag           Description
904     ==================== ==================================================
905     -g[no-]embed-source  Enable/disable embedding source text in DWARF
906                          debug sections. Useful for environments where
907                          source cannot be written to disk, such as
908                          when performing online compilation.
909     ==================== ==================================================
910
911This option enables one extended content types in the DWARF v5 Line Number
912Program Header, which is used to encode embedded source.
913
914  .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
915     :name: amdgpu-dwarf-extended-content-types
916
917     ============================  ======================
918     Content Type                  Form
919     ============================  ======================
920     ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
921     ============================  ======================
922
923The source field will contain the UTF-8 encoded, null-terminated source text
924with ``'\n'`` line endings. When the source field is present, consumers can use
925the embedded source instead of attempting to discover the source on disk. When
926the source field is absent, consumers can access the file to get the source
927text.
928
929The above content type appears in the ``file_name_entry_format`` field of the
930line table prologue, and its corresponding value appear in the ``file_names``
931field. The current encoding of the content type is documented in table
932:ref:`amdgpu-dwarf-extended-content-types-encoding`
933
934  .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
935     :name: amdgpu-dwarf-extended-content-types-encoding
936
937     ============================  ====================
938     Content Type                  Value
939     ============================  ====================
940     ``DW_LNCT_LLVM_source``       0x2001
941     ============================  ====================
942
943.. _amdgpu-code-conventions:
944
945Code Conventions
946================
947
948This section provides code conventions used for each supported target triple OS
949(see :ref:`amdgpu-target-triples`).
950
951AMDHSA
952------
953
954This section provides code conventions used when the target triple OS is
955``amdhsa`` (see :ref:`amdgpu-target-triples`).
956
957.. _amdgpu-amdhsa-code-object-target-identification:
958
959Code Object Target Identification
960~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
961
962The AMDHSA OS uses the following syntax to specify the code object
963target as a single string:
964
965  ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
966
967Where:
968
969  - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
970    are the same as the *Target Triple* (see
971    :ref:`amdgpu-target-triples`).
972
973  - ``<Processor>`` is the same as the *Processor* (see
974    :ref:`amdgpu-processors`).
975
976  - ``<Target Features>`` is a list of the enabled *Target Features*
977    (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
978    apply to *Processor*. The list must be in the same order as listed
979    in the table :ref:`amdgpu-target-feature-table`. Note that *Target
980    Features* must be included in the list if they are enabled even if
981    that is the default for *Processor*.
982
983For example:
984
985  ``"amdgcn-amd-amdhsa--gfx902+xnack"``
986
987.. _amdgpu-amdhsa-code-object-metadata:
988
989Code Object Metadata
990~~~~~~~~~~~~~~~~~~~~
991
992The code object metadata specifies extensible metadata associated with the code
993objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
994[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
995(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
996``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
997information necessary to support the ROCM kernel queries. For example, the
998segment sizes needed in a dispatch packet. In addition, a high level language
999runtime may require other information to be included. For example, the AMD
1000OpenCL runtime records kernel argument information.
1001
1002The metadata is specified as a YAML formatted string (see [YAML]_ and
1003:doc:`YamlIO`).
1004
1005.. TODO
1006   Is the string null terminated? It probably should not if YAML allows it to
1007   contain null characters, otherwise it should be.
1008
1009The metadata is represented as a single YAML document comprised of the mapping
1010defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
1011referenced tables.
1012
1013For boolean values, the string values of ``false`` and ``true`` are used for
1014false and true respectively.
1015
1016Additional information can be added to the mappings. To avoid conflicts, any
1017non-AMD key names should be prefixed by "*vendor-name*.".
1018
1019  .. table:: AMDHSA Code Object Metadata Mapping
1020     :name: amdgpu-amdhsa-code-object-metadata-mapping-table
1021
1022     ========== ============== ========= =======================================
1023     String Key Value Type     Required? Description
1024     ========== ============== ========= =======================================
1025     "Version"  sequence of    Required  - The first integer is the major
1026                2 integers                 version. Currently 1.
1027                                         - The second integer is the minor
1028                                           version. Currently 0.
1029     "Printf"   sequence of              Each string is encoded information
1030                strings                  about a printf function call. The
1031                                         encoded information is organized as
1032                                         fields separated by colon (':'):
1033
1034                                         ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1035
1036                                         where:
1037
1038                                         ``ID``
1039                                           A 32 bit integer as a unique id for
1040                                           each printf function call
1041
1042                                         ``N``
1043                                           A 32 bit integer equal to the number
1044                                           of arguments of printf function call
1045                                           minus 1
1046
1047                                         ``S[i]`` (where i = 0, 1, ... , N-1)
1048                                           32 bit integers for the size in bytes
1049                                           of the i-th FormatString argument of
1050                                           the printf function call
1051
1052                                         FormatString
1053                                           The format string passed to the
1054                                           printf function call.
1055     "Kernels"  sequence of    Required  Sequence of the mappings for each
1056                mapping                  kernel in the code object. See
1057                                         :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1058                                         for the definition of the mapping.
1059     ========== ============== ========= =======================================
1060
1061..
1062
1063  .. table:: AMDHSA Code Object Kernel Metadata Mapping
1064     :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1065
1066     ================= ============== ========= ================================
1067     String Key        Value Type     Required? Description
1068     ================= ============== ========= ================================
1069     "Name"            string         Required  Source name of the kernel.
1070     "SymbolName"      string         Required  Name of the kernel
1071                                                descriptor ELF symbol.
1072     "Language"        string                   Source language of the kernel.
1073                                                Values include:
1074
1075                                                - "OpenCL C"
1076                                                - "OpenCL C++"
1077                                                - "HCC"
1078                                                - "OpenMP"
1079
1080     "LanguageVersion" sequence of              - The first integer is the major
1081                       2 integers                 version.
1082                                                - The second integer is the
1083                                                  minor version.
1084     "Attrs"           mapping                  Mapping of kernel attributes.
1085                                                See
1086                                                :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1087                                                for the mapping definition.
1088     "Args"            sequence of              Sequence of mappings of the
1089                       mapping                  kernel arguments. See
1090                                                :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1091                                                for the definition of the mapping.
1092     "CodeProps"       mapping                  Mapping of properties related to
1093                                                the kernel code. See
1094                                                :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1095                                                for the mapping definition.
1096     ================= ============== ========= ================================
1097
1098..
1099
1100  .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1101     :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1102
1103     =================== ============== ========= ==============================
1104     String Key          Value Type     Required? Description
1105     =================== ============== ========= ==============================
1106     "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
1107                         3 integers               must be >=1 and the dispatch
1108                                                  work-group size X, Y, Z must
1109                                                  correspond to the specified
1110                                                  values. Defaults to 0, 0, 0.
1111
1112                                                  Corresponds to the OpenCL
1113                                                  ``reqd_work_group_size``
1114                                                  attribute.
1115     "WorkGroupSizeHint" sequence of              The dispatch work-group size
1116                         3 integers               X, Y, Z is likely to be the
1117                                                  specified values.
1118
1119                                                  Corresponds to the OpenCL
1120                                                  ``work_group_size_hint``
1121                                                  attribute.
1122     "VecTypeHint"       string                   The name of a scalar or vector
1123                                                  type.
1124
1125                                                  Corresponds to the OpenCL
1126                                                  ``vec_type_hint`` attribute.
1127
1128     "RuntimeHandle"     string                   The external symbol name
1129                                                  associated with a kernel.
1130                                                  OpenCL runtime allocates a
1131                                                  global buffer for the symbol
1132                                                  and saves the kernel's address
1133                                                  to it, which is used for
1134                                                  device side enqueueing. Only
1135                                                  available for device side
1136                                                  enqueued kernels.
1137     =================== ============== ========= ==============================
1138
1139..
1140
1141  .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1142     :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1143
1144     ================= ============== ========= ================================
1145     String Key        Value Type     Required? Description
1146     ================= ============== ========= ================================
1147     "Name"            string                   Kernel argument name.
1148     "TypeName"        string                   Kernel argument type name.
1149     "Size"            integer        Required  Kernel argument size in bytes.
1150     "Align"           integer        Required  Kernel argument alignment in
1151                                                bytes. Must be a power of two.
1152     "ValueKind"       string         Required  Kernel argument kind that
1153                                                specifies how to set up the
1154                                                corresponding argument.
1155                                                Values include:
1156
1157                                                "ByValue"
1158                                                  The argument is copied
1159                                                  directly into the kernarg.
1160
1161                                                "GlobalBuffer"
1162                                                  A global address space pointer
1163                                                  to the buffer data is passed
1164                                                  in the kernarg.
1165
1166                                                "DynamicSharedPointer"
1167                                                  A group address space pointer
1168                                                  to dynamically allocated LDS
1169                                                  is passed in the kernarg.
1170
1171                                                "Sampler"
1172                                                  A global address space
1173                                                  pointer to a S# is passed in
1174                                                  the kernarg.
1175
1176                                                "Image"
1177                                                  A global address space
1178                                                  pointer to a T# is passed in
1179                                                  the kernarg.
1180
1181                                                "Pipe"
1182                                                  A global address space pointer
1183                                                  to an OpenCL pipe is passed in
1184                                                  the kernarg.
1185
1186                                                "Queue"
1187                                                  A global address space pointer
1188                                                  to an OpenCL device enqueue
1189                                                  queue is passed in the
1190                                                  kernarg.
1191
1192                                                "HiddenGlobalOffsetX"
1193                                                  The OpenCL grid dispatch
1194                                                  global offset for the X
1195                                                  dimension is passed in the
1196                                                  kernarg.
1197
1198                                                "HiddenGlobalOffsetY"
1199                                                  The OpenCL grid dispatch
1200                                                  global offset for the Y
1201                                                  dimension is passed in the
1202                                                  kernarg.
1203
1204                                                "HiddenGlobalOffsetZ"
1205                                                  The OpenCL grid dispatch
1206                                                  global offset for the Z
1207                                                  dimension is passed in the
1208                                                  kernarg.
1209
1210                                                "HiddenNone"
1211                                                  An argument that is not used
1212                                                  by the kernel. Space needs to
1213                                                  be left for it, but it does
1214                                                  not need to be set up.
1215
1216                                                "HiddenPrintfBuffer"
1217                                                  A global address space pointer
1218                                                  to the runtime printf buffer
1219                                                  is passed in kernarg.
1220
1221                                                "HiddenDefaultQueue"
1222                                                  A global address space pointer
1223                                                  to the OpenCL device enqueue
1224                                                  queue that should be used by
1225                                                  the kernel by default is
1226                                                  passed in the kernarg.
1227
1228                                                "HiddenCompletionAction"
1229                                                  A global address space pointer
1230                                                  to help link enqueued kernels into
1231                                                  the ancestor tree for determining
1232                                                  when the parent kernel has finished.
1233
1234     "ValueType"       string         Required  Kernel argument value type. Only
1235                                                present if "ValueKind" is
1236                                                "ByValue". For vector data
1237                                                types, the value is for the
1238                                                element type. Values include:
1239
1240                                                - "Struct"
1241                                                - "I8"
1242                                                - "U8"
1243                                                - "I16"
1244                                                - "U16"
1245                                                - "F16"
1246                                                - "I32"
1247                                                - "U32"
1248                                                - "F32"
1249                                                - "I64"
1250                                                - "U64"
1251                                                - "F64"
1252
1253                                                .. TODO
1254                                                   How can it be determined if a
1255                                                   vector type, and what size
1256                                                   vector?
1257     "PointeeAlign"    integer                  Alignment in bytes of pointee
1258                                                type for pointer type kernel
1259                                                argument. Must be a power
1260                                                of 2. Only present if
1261                                                "ValueKind" is
1262                                                "DynamicSharedPointer".
1263     "AddrSpaceQual"   string                   Kernel argument address space
1264                                                qualifier. Only present if
1265                                                "ValueKind" is "GlobalBuffer" or
1266                                                "DynamicSharedPointer". Values
1267                                                are:
1268
1269                                                - "Private"
1270                                                - "Global"
1271                                                - "Constant"
1272                                                - "Local"
1273                                                - "Generic"
1274                                                - "Region"
1275
1276                                                .. TODO
1277                                                   Is GlobalBuffer only Global
1278                                                   or Constant? Is
1279                                                   DynamicSharedPointer always
1280                                                   Local? Can HCC allow Generic?
1281                                                   How can Private or Region
1282                                                   ever happen?
1283     "AccQual"         string                   Kernel argument access
1284                                                qualifier. Only present if
1285                                                "ValueKind" is "Image" or
1286                                                "Pipe". Values
1287                                                are:
1288
1289                                                - "ReadOnly"
1290                                                - "WriteOnly"
1291                                                - "ReadWrite"
1292
1293                                                .. TODO
1294                                                   Does this apply to
1295                                                   GlobalBuffer?
1296     "ActualAccQual"   string                   The actual memory accesses
1297                                                performed by the kernel on the
1298                                                kernel argument. Only present if
1299                                                "ValueKind" is "GlobalBuffer",
1300                                                "Image", or "Pipe". This may be
1301                                                more restrictive than indicated
1302                                                by "AccQual" to reflect what the
1303                                                kernel actual does. If not
1304                                                present then the runtime must
1305                                                assume what is implied by
1306                                                "AccQual" and "IsConst". Values
1307                                                are:
1308
1309                                                - "ReadOnly"
1310                                                - "WriteOnly"
1311                                                - "ReadWrite"
1312
1313     "IsConst"         boolean                  Indicates if the kernel argument
1314                                                is const qualified. Only present
1315                                                if "ValueKind" is
1316                                                "GlobalBuffer".
1317
1318     "IsRestrict"      boolean                  Indicates if the kernel argument
1319                                                is restrict qualified. Only
1320                                                present if "ValueKind" is
1321                                                "GlobalBuffer".
1322
1323     "IsVolatile"      boolean                  Indicates if the kernel argument
1324                                                is volatile qualified. Only
1325                                                present if "ValueKind" is
1326                                                "GlobalBuffer".
1327
1328     "IsPipe"          boolean                  Indicates if the kernel argument
1329                                                is pipe qualified. Only present
1330                                                if "ValueKind" is "Pipe".
1331
1332                                                .. TODO
1333                                                   Can GlobalBuffer be pipe
1334                                                   qualified?
1335     ================= ============== ========= ================================
1336
1337..
1338
1339  .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1340     :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1341
1342     ============================ ============== ========= =====================
1343     String Key                   Value Type     Required? Description
1344     ============================ ============== ========= =====================
1345     "KernargSegmentSize"         integer        Required  The size in bytes of
1346                                                           the kernarg segment
1347                                                           that holds the values
1348                                                           of the arguments to
1349                                                           the kernel.
1350     "GroupSegmentFixedSize"      integer        Required  The amount of group
1351                                                           segment memory
1352                                                           required by a
1353                                                           work-group in
1354                                                           bytes. This does not
1355                                                           include any
1356                                                           dynamically allocated
1357                                                           group segment memory
1358                                                           that may be added
1359                                                           when the kernel is
1360                                                           dispatched.
1361     "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1362                                                           private address space
1363                                                           memory required for a
1364                                                           work-item in
1365                                                           bytes. If the kernel
1366                                                           uses a dynamic call
1367                                                           stack then additional
1368                                                           space must be added
1369                                                           to this value for the
1370                                                           call stack.
1371     "KernargSegmentAlign"        integer        Required  The maximum byte
1372                                                           alignment of
1373                                                           arguments in the
1374                                                           kernarg segment. Must
1375                                                           be a power of 2.
1376     "WavefrontSize"              integer        Required  Wavefront size. Must
1377                                                           be a power of 2.
1378     "NumSGPRs"                   integer        Required  Number of scalar
1379                                                           registers used by a
1380                                                           wavefront for
1381                                                           GFX6-GFX9. This
1382                                                           includes the special
1383                                                           SGPRs for VCC, Flat
1384                                                           Scratch (GFX7-GFX9)
1385                                                           and XNACK (for
1386                                                           GFX8-GFX9). It does
1387                                                           not include the 16
1388                                                           SGPR added if a trap
1389                                                           handler is
1390                                                           enabled. It is not
1391                                                           rounded up to the
1392                                                           allocation
1393                                                           granularity.
1394     "NumVGPRs"                   integer        Required  Number of vector
1395                                                           registers used by
1396                                                           each work-item for
1397                                                           GFX6-GFX9
1398     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1399                                                           work-group size
1400                                                           supported by the
1401                                                           kernel in work-items.
1402                                                           Must be >=1 and
1403                                                           consistent with
1404                                                           ReqdWorkGroupSize if
1405                                                           not 0, 0, 0.
1406     "NumSpilledSGPRs"            integer                  Number of stores from
1407                                                           a scalar register to
1408                                                           a register allocator
1409                                                           created spill
1410                                                           location.
1411     "NumSpilledVGPRs"            integer                  Number of stores from
1412                                                           a vector register to
1413                                                           a register allocator
1414                                                           created spill
1415                                                           location.
1416     ============================ ============== ========= =====================
1417
1418..
1419
1420Kernel Dispatch
1421~~~~~~~~~~~~~~~
1422
1423The HSA architected queuing language (AQL) defines a user space memory interface
1424that can be used to control the dispatch of kernels, in an agent independent
1425way. An agent can have zero or more AQL queues created for it using the ROCm
1426runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1427*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1428mechanics and packet layouts.
1429
1430The packet processor of a kernel agent is responsible for detecting and
1431dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1432packet processor is implemented by the hardware command processor (CP),
1433asynchronous dispatch controller (ADC) and shader processor input controller
1434(SPI).
1435
1436The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1437mode driver to initialize and register the AQL queue with CP.
1438
1439To dispatch a kernel the following actions are performed. This can occur in the
1440CPU host program, or from an HSA kernel executing on a GPU.
1441
14421. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1443   executed is obtained.
14442. A pointer to the kernel descriptor (see
1445   :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1446   obtained. It must be for a kernel that is contained in a code object that that
1447   was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1448   associated.
14493. Space is allocated for the kernel arguments using the ROCm runtime allocator
1450   for a memory region with the kernarg property for the kernel agent that will
1451   execute the kernel. It must be at least 16 byte aligned.
14524. Kernel argument values are assigned to the kernel argument memory
1453   allocation. The layout is defined in the *HSA Programmer's Language Reference*
1454   [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1455   memory in the same way constant memory is accessed. (Note that the HSA
1456   specification allows an implementation to copy the kernel argument contents to
1457   another location that is accessed by the kernel.)
14585. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1459   api uses 64 bit atomic operations to reserve space in the AQL queue for the
1460   packet. The packet must be set up, and the final write must use an atomic
1461   store release to set the packet kind to ensure the packet contents are
1462   visible to the kernel agent. AQL defines a doorbell signal mechanism to
1463   notify the kernel agent that the AQL queue has been updated. These rules, and
1464   the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1465   System Architecture Specification* [HSA]_.
14666. A kernel dispatch packet includes information about the actual dispatch,
1467   such as grid and work-group size, together with information from the code
1468   object about the kernel, such as segment sizes. The ROCm runtime queries on
1469   the kernel symbol can be used to obtain the code object values which are
1470   recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
14717. CP executes micro-code and is responsible for detecting and setting up the
1472   GPU to execute the wavefronts of a kernel dispatch.
14738. CP ensures that when the a wavefront starts executing the kernel machine
1474   code, the scalar general purpose registers (SGPR) and vector general purpose
1475   registers (VGPR) are set up as required by the machine code. The required
1476   setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1477   register state is defined in
1478   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14799. The prolog of the kernel machine code (see
1480   :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1481   before continuing executing the machine code that corresponds to the kernel.
148210. When the kernel dispatch has completed execution, CP signals the completion
1483    signal specified in the kernel dispatch packet if not 0.
1484
1485.. _amdgpu-amdhsa-memory-spaces:
1486
1487Memory Spaces
1488~~~~~~~~~~~~~
1489
1490The memory space properties are:
1491
1492  .. table:: AMDHSA Memory Spaces
1493     :name: amdgpu-amdhsa-memory-spaces-table
1494
1495     ================= =========== ======== ======= ==================
1496     Memory Space Name HSA Segment Hardware Address NULL Value
1497                       Name        Name     Size
1498     ================= =========== ======== ======= ==================
1499     Private           private     scratch  32      0x00000000
1500     Local             group       LDS      32      0xFFFFFFFF
1501     Global            global      global   64      0x0000000000000000
1502     Constant          constant    *same as 64      0x0000000000000000
1503                                   global*
1504     Generic           flat        flat     64      0x0000000000000000
1505     Region            N/A         GDS      32      *not implemented
1506                                                    for AMDHSA*
1507     ================= =========== ======== ======= ==================
1508
1509The global and constant memory spaces both use global virtual addresses, which
1510are the same virtual address space used by the CPU. However, some virtual
1511addresses may only be accessible to the CPU, some only accessible by the GPU,
1512and some by both.
1513
1514Using the constant memory space indicates that the data will not change during
1515the execution of the kernel. This allows scalar read instructions to be
1516used. The vector and scalar L1 caches are invalidated of volatile data before
1517each kernel dispatch execution to allow constant memory to change values between
1518kernel dispatches.
1519
1520The local memory space uses the hardware Local Data Store (LDS) which is
1521automatically allocated when the hardware creates work-groups of wavefronts, and
1522freed when all the wavefronts of a work-group have terminated. The data store
1523(DS) instructions can be used to access it.
1524
1525The private memory space uses the hardware scratch memory support. If the kernel
1526uses scratch, then the hardware allocates memory that is accessed using
1527wavefront lane dword (4 byte) interleaving. The mapping used from private
1528address to physical address is:
1529
1530  ``wavefront-scratch-base +
1531  (private-address * wavefront-size * 4) +
1532  (wavefront-lane-id * 4)``
1533
1534There are different ways that the wavefront scratch base address is determined
1535by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1536memory can be accessed in an interleaved manner using buffer instruction with
1537the scratch buffer descriptor and per wavefront scratch offset, by the scratch
1538instructions, or by flat instructions. If each lane of a wavefront accesses the
1539same private address, the interleaving results in adjacent dwords being accessed
1540and hence requires fewer cache lines to be fetched. Multi-dword access is not
1541supported except by flat and scratch instructions in GFX9.
1542
1543The generic address space uses the hardware flat address support available in
1544GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1545local appertures), that are outside the range of addressible global memory, to
1546map from a flat address to a private or local address.
1547
1548FLAT instructions can take a flat address and access global, private (scratch)
1549and group (LDS) memory depending in if the address is within one of the
1550apperture ranges. Flat access to scratch requires hardware aperture setup and
1551setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1552access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1553(see :ref:`amdgpu-amdhsa-m0`).
1554
1555To convert between a segment address and a flat address the base address of the
1556appertures address can be used. For GFX7-GFX8 these are available in the
1557:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1558Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1559GFX9 the appature base addresses are directly available as inline constant
1560registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1561address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1562which makes it easier to convert from flat to segment or segment to flat.
1563
1564Image and Samplers
1565~~~~~~~~~~~~~~~~~~
1566
1567Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1568hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1569HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1570enumeration values for the queries that are not trivially deducible from the S#
1571representation.
1572
1573HSA Signals
1574~~~~~~~~~~~
1575
1576HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1577structure allocated in memory accessible from both the CPU and GPU. The
1578structure is defined by the ROCm runtime and subject to change between releases
1579(see [AMD-ROCm-github]_).
1580
1581.. _amdgpu-amdhsa-hsa-aql-queue:
1582
1583HSA AQL Queue
1584~~~~~~~~~~~~~
1585
1586The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1587between releases (see [AMD-ROCm-github]_). For some processors it contains
1588fields needed to implement certain language features such as the flat address
1589aperture bases. It also contains fields used by CP such as managing the
1590allocation of scratch memory.
1591
1592.. _amdgpu-amdhsa-kernel-descriptor:
1593
1594Kernel Descriptor
1595~~~~~~~~~~~~~~~~~
1596
1597A kernel descriptor consists of the information needed by CP to initiate the
1598execution of a kernel, including the entry point address of the machine code
1599that implements the kernel.
1600
1601Kernel Descriptor for GFX6-GFX9
1602+++++++++++++++++++++++++++++++
1603
1604CP microcode requires the Kernel descriptor to be allocated on 64 byte
1605alignment.
1606
1607  .. table:: Kernel Descriptor for GFX6-GFX9
1608     :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1609
1610     ======= ======= =============================== ============================
1611     Bits    Size    Field Name                      Description
1612     ======= ======= =============================== ============================
1613     31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
1614                                                     address space memory
1615                                                     required for a work-group
1616                                                     in bytes. This does not
1617                                                     include any dynamically
1618                                                     allocated local address
1619                                                     space memory that may be
1620                                                     added when the kernel is
1621                                                     dispatched.
1622     63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
1623                                                     private address space
1624                                                     memory required for a
1625                                                     work-item in bytes. If
1626                                                     is_dynamic_callstack is 1
1627                                                     then additional space must
1628                                                     be added to this value for
1629                                                     the call stack.
1630     127:64  8 bytes                                 Reserved, must be 0.
1631     191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
1632                                                     negative) from base
1633                                                     address of kernel
1634                                                     descriptor to kernel's
1635                                                     entry point instruction
1636                                                     which must be 256 byte
1637                                                     aligned.
1638     383:192 24                                      Reserved, must be 0.
1639             bytes
1640     415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
1641                                                     program settings used by
1642                                                     CP to set up
1643                                                     ``COMPUTE_PGM_RSRC1``
1644                                                     configuration
1645                                                     register. See
1646                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1647     447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
1648                                                     program settings used by
1649                                                     CP to set up
1650                                                     ``COMPUTE_PGM_RSRC2``
1651                                                     configuration
1652                                                     register. See
1653                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1654     448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1655                     _BUFFER                         SGPR user data registers
1656                                                     (see
1657                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1658
1659                                                     The total number of SGPR
1660                                                     user data registers
1661                                                     requested must not exceed
1662                                                     16 and match value in
1663                                                     ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1664                                                     Any requests beyond 16
1665                                                     will be ignored.
1666     449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above*
1667     450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above*
1668     451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
1669     452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above*
1670     453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above*
1671     454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above*
1672                     _SIZE
1673     455     1 bit                                   Reserved, must be 0.
1674     511:456 8 bytes                                 Reserved, must be 0.
1675     512     **Total size 64 bytes.**
1676     ======= ====================================================================
1677
1678..
1679
1680  .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1681     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1682
1683     ======= ======= =============================== ===========================================================================
1684     Bits    Size    Field Name                      Description
1685     ======= ======= =============================== ===========================================================================
1686     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
1687                                                     blocks used by each work-item;
1688                                                     granularity is device
1689                                                     specific:
1690
1691                                                     GFX6-GFX9
1692                                                       - vgprs_used 0..256
1693                                                       - max(0, ceil(vgprs_used / 4) - 1)
1694
1695                                                     Where vgprs_used is defined
1696                                                     as the highest VGPR number
1697                                                     explicitly referenced plus
1698                                                     one.
1699
1700                                                     Used by CP to set up
1701                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
1702
1703                                                     The
1704                                                     :ref:`amdgpu-assembler`
1705                                                     calculates this
1706                                                     automatically for the
1707                                                     selected processor from
1708                                                     values provided to the
1709                                                     `.amdhsa_kernel` directive
1710                                                     by the
1711                                                     `.amdhsa_next_free_vgpr`
1712                                                     nested directive (see
1713                                                     :ref:`amdhsa-kernel-directives-table`).
1714     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
1715                                                     blocks used by a wavefront;
1716                                                     granularity is device
1717                                                     specific:
1718
1719                                                     GFX6-GFX8
1720                                                       - sgprs_used 0..112
1721                                                       - max(0, ceil(sgprs_used / 8) - 1)
1722                                                     GFX9
1723                                                       - sgprs_used 0..112
1724                                                       - 2 * max(0, ceil(sgprs_used / 16) - 1)
1725
1726                                                     Where sgprs_used is
1727                                                     defined as the highest
1728                                                     SGPR number explicitly
1729                                                     referenced plus one, plus
1730                                                     a target-specific number
1731                                                     of additional special
1732                                                     SGPRs for VCC,
1733                                                     FLAT_SCRATCH (GFX7+) and
1734                                                     XNACK_MASK (GFX8+), and
1735                                                     any additional
1736                                                     target-specific
1737                                                     limitations. It does not
1738                                                     include the 16 SGPRs added
1739                                                     if a trap handler is
1740                                                     enabled.
1741
1742                                                     The target-specific
1743                                                     limitations and special
1744                                                     SGPR layout are defined in
1745                                                     the hardware
1746                                                     documentation, which can
1747                                                     be found in the
1748                                                     :ref:`amdgpu-processors`
1749                                                     table.
1750
1751                                                     Used by CP to set up
1752                                                     ``COMPUTE_PGM_RSRC1.SGPRS``.
1753
1754                                                     The
1755                                                     :ref:`amdgpu-assembler`
1756                                                     calculates this
1757                                                     automatically for the
1758                                                     selected processor from
1759                                                     values provided to the
1760                                                     `.amdhsa_kernel` directive
1761                                                     by the
1762                                                     `.amdhsa_next_free_sgpr`
1763                                                     and `.amdhsa_reserve_*`
1764                                                     nested directives (see
1765                                                     :ref:`amdhsa-kernel-directives-table`).
1766     11:10   2 bits  PRIORITY                        Must be 0.
1767
1768                                                     Start executing wavefront
1769                                                     at the specified priority.
1770
1771                                                     CP is responsible for
1772                                                     filling in
1773                                                     ``COMPUTE_PGM_RSRC1.PRIORITY``.
1774     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1775                                                     with specified rounding
1776                                                     mode for single (32
1777                                                     bit) floating point
1778                                                     precision floating point
1779                                                     operations.
1780
1781                                                     Floating point rounding
1782                                                     mode values are defined in
1783                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1784
1785                                                     Used by CP to set up
1786                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1787     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1788                                                     with specified rounding
1789                                                     denorm mode for half/double (16
1790                                                     and 64 bit) floating point
1791                                                     precision floating point
1792                                                     operations.
1793
1794                                                     Floating point rounding
1795                                                     mode values are defined in
1796                                                     :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1797
1798                                                     Used by CP to set up
1799                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1800     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1801                                                     with specified denorm mode
1802                                                     for single (32
1803                                                     bit)  floating point
1804                                                     precision floating point
1805                                                     operations.
1806
1807                                                     Floating point denorm mode
1808                                                     values are defined in
1809                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1810
1811                                                     Used by CP to set up
1812                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1813     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1814                                                     with specified denorm mode
1815                                                     for half/double (16
1816                                                     and 64 bit) floating point
1817                                                     precision floating point
1818                                                     operations.
1819
1820                                                     Floating point denorm mode
1821                                                     values are defined in
1822                                                     :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1823
1824                                                     Used by CP to set up
1825                                                     ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1826     20      1 bit   PRIV                            Must be 0.
1827
1828                                                     Start executing wavefront
1829                                                     in privilege trap handler
1830                                                     mode.
1831
1832                                                     CP is responsible for
1833                                                     filling in
1834                                                     ``COMPUTE_PGM_RSRC1.PRIV``.
1835     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1836                                                     with DX10 clamp mode
1837                                                     enabled. Used by the vector
1838                                                     ALU to force DX10 style
1839                                                     treatment of NaN's (when
1840                                                     set, clamp NaN to zero,
1841                                                     otherwise pass NaN
1842                                                     through).
1843
1844                                                     Used by CP to set up
1845                                                     ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1846     22      1 bit   DEBUG_MODE                      Must be 0.
1847
1848                                                     Start executing wavefront
1849                                                     in single step mode.
1850
1851                                                     CP is responsible for
1852                                                     filling in
1853                                                     ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1854     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1855                                                     with IEEE mode
1856                                                     enabled. Floating point
1857                                                     opcodes that support
1858                                                     exception flag gathering
1859                                                     will quiet and propagate
1860                                                     signaling-NaN inputs per
1861                                                     IEEE 754-2008. Min_dx10 and
1862                                                     max_dx10 become IEEE
1863                                                     754-2008 compliant due to
1864                                                     signaling-NaN propagation
1865                                                     and quieting.
1866
1867                                                     Used by CP to set up
1868                                                     ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1869     24      1 bit   BULKY                           Must be 0.
1870
1871                                                     Only one work-group allowed
1872                                                     to execute on a compute
1873                                                     unit.
1874
1875                                                     CP is responsible for
1876                                                     filling in
1877                                                     ``COMPUTE_PGM_RSRC1.BULKY``.
1878     25      1 bit   CDBG_USER                       Must be 0.
1879
1880                                                     Flag that can be used to
1881                                                     control debugging code.
1882
1883                                                     CP is responsible for
1884                                                     filling in
1885                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1886     26      1 bit   FP16_OVFL                       GFX6-GFX8
1887                                                       Reserved, must be 0.
1888                                                     GFX9
1889                                                       Wavefront starts execution
1890                                                       with specified fp16 overflow
1891                                                       mode.
1892
1893                                                       - If 0, fp16 overflow generates
1894                                                         +/-INF values.
1895                                                       - If 1, fp16 overflow that is the
1896                                                         result of an +/-INF input value
1897                                                         or divide by 0 produces a +/-INF,
1898                                                         otherwise clamps computed
1899                                                         overflow to +/-MAX_FP16 as
1900                                                         appropriate.
1901
1902                                                       Used by CP to set up
1903                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1904     31:27   5 bits                                  Reserved, must be 0.
1905     32      **Total size 4 bytes**
1906     ======= ===================================================================================================================
1907
1908..
1909
1910  .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1911     :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1912
1913     ======= ======= =============================== ===========================================================================
1914     Bits    Size    Field Name                      Description
1915     ======= ======= =============================== ===========================================================================
1916     0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1917                     _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
1918                                                     system register (see
1919                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1920
1921                                                     Used by CP to set up
1922                                                     ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1923     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1924                                                     user data registers
1925                                                     requested. This number must
1926                                                     match the number of user
1927                                                     data registers enabled.
1928
1929                                                     Used by CP to set up
1930                                                     ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1931     6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
1932
1933                                                     This bit represents
1934                                                     ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
1935                                                     which is set by the CP if
1936                                                     the runtime has installed a
1937                                                     trap handler.
1938     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1939                                                     system SGPR register for
1940                                                     the work-group id in the X
1941                                                     dimension (see
1942                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1943
1944                                                     Used by CP to set up
1945                                                     ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1946     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1947                                                     system SGPR register for
1948                                                     the work-group id in the Y
1949                                                     dimension (see
1950                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1951
1952                                                     Used by CP to set up
1953                                                     ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1954     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1955                                                     system SGPR register for
1956                                                     the work-group id in the Z
1957                                                     dimension (see
1958                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1959
1960                                                     Used by CP to set up
1961                                                     ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1962     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1963                                                     system SGPR register for
1964                                                     work-group information (see
1965                                                     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1966
1967                                                     Used by CP to set up
1968                                                     ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1969     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1970                                                     VGPR system registers used
1971                                                     for the work-item ID.
1972                                                     :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1973                                                     defines the values.
1974
1975                                                     Used by CP to set up
1976                                                     ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1977     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1978
1979                                                     Wavefront starts execution
1980                                                     with address watch
1981                                                     exceptions enabled which
1982                                                     are generated when L1 has
1983                                                     witnessed a thread access
1984                                                     an *address of
1985                                                     interest*.
1986
1987                                                     CP is responsible for
1988                                                     filling in the address
1989                                                     watch bit in
1990                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1991                                                     according to what the
1992                                                     runtime requests.
1993     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1994
1995                                                     Wavefront starts execution
1996                                                     with memory violation
1997                                                     exceptions exceptions
1998                                                     enabled which are generated
1999                                                     when a memory violation has
2000                                                     occurred for this wavefront from
2001                                                     L1 or LDS
2002                                                     (write-to-read-only-memory,
2003                                                     mis-aligned atomic, LDS
2004                                                     address out of range,
2005                                                     illegal address, etc.).
2006
2007                                                     CP sets the memory
2008                                                     violation bit in
2009                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2010                                                     according to what the
2011                                                     runtime requests.
2012     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
2013
2014                                                     CP uses the rounded value
2015                                                     from the dispatch packet,
2016                                                     not this value, as the
2017                                                     dispatch may contain
2018                                                     dynamically allocated group
2019                                                     segment memory. CP writes
2020                                                     directly to
2021                                                     ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
2022
2023                                                     Amount of group segment
2024                                                     (LDS) to allocate for each
2025                                                     work-group. Granularity is
2026                                                     device specific:
2027
2028                                                     GFX6:
2029                                                       roundup(lds-size / (64 * 4))
2030                                                     GFX7-GFX9:
2031                                                       roundup(lds-size / (128 * 4))
2032
2033     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
2034                     _INVALID_OPERATION              with specified exceptions
2035                                                     enabled.
2036
2037                                                     Used by CP to set up
2038                                                     ``COMPUTE_PGM_RSRC2.EXCP_EN``
2039                                                     (set from bits 0..6).
2040
2041                                                     IEEE 754 FP Invalid
2042                                                     Operation
2043     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
2044                     _SOURCE                         input operands is a
2045                                                     denormal number
2046     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
2047                     _DIVISION_BY_ZERO               Zero
2048     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
2049                     _OVERFLOW
2050     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
2051                     _UNDERFLOW
2052     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
2053                     _INEXACT
2054     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
2055                     _ZERO                           (rcp_iflag_f32 instruction
2056                                                     only)
2057     31      1 bit                                   Reserved, must be 0.
2058     32      **Total size 4 bytes.**
2059     ======= ===================================================================================================================
2060
2061..
2062
2063  .. table:: Floating Point Rounding Mode Enumeration Values
2064     :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
2065
2066     ====================================== ===== ==============================
2067     Enumeration Name                       Value Description
2068     ====================================== ===== ==============================
2069     FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
2070     FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
2071     FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
2072     FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
2073     ====================================== ===== ==============================
2074
2075..
2076
2077  .. table:: Floating Point Denorm Mode Enumeration Values
2078     :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2079
2080     ====================================== ===== ==============================
2081     Enumeration Name                       Value Description
2082     ====================================== ===== ==============================
2083     FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
2084                                                  Denorms
2085     FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
2086     FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
2087     FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
2088     ====================================== ===== ==============================
2089
2090..
2091
2092  .. table:: System VGPR Work-Item ID Enumeration Values
2093     :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2094
2095     ======================================== ===== ============================
2096     Enumeration Name                         Value Description
2097     ======================================== ===== ============================
2098     SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
2099                                                    ID.
2100     SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
2101                                                    dimensions ID.
2102     SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
2103                                                    dimensions ID.
2104     SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
2105     ======================================== ===== ============================
2106
2107.. _amdgpu-amdhsa-initial-kernel-execution-state:
2108
2109Initial Kernel Execution State
2110~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2111
2112This section defines the register state that will be set up by the packet
2113processor prior to the start of execution of every wavefront. This is limited by
2114the constraints of the hardware controllers of CP/ADC/SPI.
2115
2116The order of the SGPR registers is defined, but the compiler can specify which
2117ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2118fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2119for enabled registers are dense starting at SGPR0: the first enabled register is
2120SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2121an SGPR number.
2122
2123The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2124all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2125the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2126initialized. These are then immediately followed by the System SGPRs that are
2127set up by ADC/SPI and can have different values for each wavefront of the grid
2128dispatch.
2129
2130SGPR register initial state is defined in
2131:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2132
2133  .. table:: SGPR Register Set Up Order
2134     :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2135
2136     ========== ========================== ====== ==============================
2137     SGPR Order Name                       Number Description
2138                (kernel descriptor enable  of
2139                field)                     SGPRs
2140     ========== ========================== ====== ==============================
2141     First      Private Segment Buffer     4      V# that can be used, together
2142                (enable_sgpr_private              with Scratch Wavefront Offset
2143                _segment_buffer)                  as an offset, to access the
2144                                                  private memory space using a
2145                                                  segment address.
2146
2147                                                  CP uses the value provided by
2148                                                  the runtime.
2149     then       Dispatch Ptr               2      64 bit address of AQL dispatch
2150                (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2151                                                  actually executing.
2152     then       Queue Ptr                  2      64 bit address of amd_queue_t
2153                (enable_sgpr_queue_ptr)           object for AQL queue on which
2154                                                  the dispatch packet was
2155                                                  queued.
2156     then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2157                (enable_sgpr_kernarg              segment. This is directly
2158                _segment_ptr)                     copied from the
2159                                                  kernarg_address in the kernel
2160                                                  dispatch packet.
2161
2162                                                  Having CP load it once avoids
2163                                                  loading it at the beginning of
2164                                                  every wavefront.
2165     then       Dispatch Id                2      64 bit Dispatch ID of the
2166                (enable_sgpr_dispatch_id)         dispatch packet being
2167                                                  executed.
2168     then       Flat Scratch Init          2      This is 2 SGPRs:
2169                (enable_sgpr_flat_scratch
2170                _init)                            GFX6
2171                                                    Not supported.
2172                                                  GFX7-GFX8
2173                                                    The first SGPR is a 32 bit
2174                                                    byte offset from
2175                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2176                                                    to per SPI base of memory
2177                                                    for scratch for the queue
2178                                                    executing the kernel
2179                                                    dispatch. CP obtains this
2180                                                    from the runtime. (The
2181                                                    Scratch Segment Buffer base
2182                                                    address is
2183                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2184                                                    plus this offset.) The value
2185                                                    of Scratch Wavefront Offset must
2186                                                    be added to this offset by
2187                                                    the kernel machine code,
2188                                                    right shifted by 8, and
2189                                                    moved to the FLAT_SCRATCH_HI
2190                                                    SGPR register.
2191                                                    FLAT_SCRATCH_HI corresponds
2192                                                    to SGPRn-4 on GFX7, and
2193                                                    SGPRn-6 on GFX8 (where SGPRn
2194                                                    is the highest numbered SGPR
2195                                                    allocated to the wavefront).
2196                                                    FLAT_SCRATCH_HI is
2197                                                    multiplied by 256 (as it is
2198                                                    in units of 256 bytes) and
2199                                                    added to
2200                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2201                                                    to calculate the per wavefront
2202                                                    FLAT SCRATCH BASE in flat
2203                                                    memory instructions that
2204                                                    access the scratch
2205                                                    apperture.
2206
2207                                                    The second SGPR is 32 bit
2208                                                    byte size of a single
2209                                                    work-item's scratch memory
2210                                                    usage. CP obtains this from
2211                                                    the runtime, and it is
2212                                                    always a multiple of DWORD.
2213                                                    CP checks that the value in
2214                                                    the kernel dispatch packet
2215                                                    Private Segment Byte Size is
2216                                                    not larger, and requests the
2217                                                    runtime to increase the
2218                                                    queue's scratch size if
2219                                                    necessary. The kernel code
2220                                                    must move it to
2221                                                    FLAT_SCRATCH_LO which is
2222                                                    SGPRn-3 on GFX7 and SGPRn-5
2223                                                    on GFX8. FLAT_SCRATCH_LO is
2224                                                    used as the FLAT SCRATCH
2225                                                    SIZE in flat memory
2226                                                    instructions. Having CP load
2227                                                    it once avoids loading it at
2228                                                    the beginning of every
2229                                                    wavefront.
2230                                                  GFX9
2231                                                    This is the
2232                                                    64 bit base address of the
2233                                                    per SPI scratch backing
2234                                                    memory managed by SPI for
2235                                                    the queue executing the
2236                                                    kernel dispatch. CP obtains
2237                                                    this from the runtime (and
2238                                                    divides it if there are
2239                                                    multiple Shader Arrays each
2240                                                    with its own SPI). The value
2241                                                    of Scratch Wavefront Offset must
2242                                                    be added by the kernel
2243                                                    machine code and the result
2244                                                    moved to the FLAT_SCRATCH
2245                                                    SGPR which is SGPRn-6 and
2246                                                    SGPRn-5. It is used as the
2247                                                    FLAT SCRATCH BASE in flat
2248                                                    memory instructions.
2249     then       Private Segment Size       1      The 32 bit byte size of a
2250                                                  (enable_sgpr_private single
2251                                                  work-item's
2252                                                  scratch_segment_size) memory
2253                                                  allocation. This is the
2254                                                  value from the kernel
2255                                                  dispatch packet Private
2256                                                  Segment Byte Size rounded up
2257                                                  by CP to a multiple of
2258                                                  DWORD.
2259
2260                                                  Having CP load it once avoids
2261                                                  loading it at the beginning of
2262                                                  every wavefront.
2263
2264                                                  This is not used for
2265                                                  GFX7-GFX8 since it is the same
2266                                                  value as the second SGPR of
2267                                                  Flat Scratch Init. However, it
2268                                                  may be needed for GFX9 which
2269                                                  changes the meaning of the
2270                                                  Flat Scratch Init value.
2271     then       Grid Work-Group Count X    1      32 bit count of the number of
2272                (enable_sgpr_grid                 work-groups in the X dimension
2273                _workgroup_count_X)               for the grid being
2274                                                  executed. Computed from the
2275                                                  fields in the kernel dispatch
2276                                                  packet as ((grid_size.x +
2277                                                  workgroup_size.x - 1) /
2278                                                  workgroup_size.x).
2279     then       Grid Work-Group Count Y    1      32 bit count of the number of
2280                (enable_sgpr_grid                 work-groups in the Y dimension
2281                _workgroup_count_Y &&             for the grid being
2282                less than 16 previous             executed. Computed from the
2283                SGPRs)                            fields in the kernel dispatch
2284                                                  packet as ((grid_size.y +
2285                                                  workgroup_size.y - 1) /
2286                                                  workgroupSize.y).
2287
2288                                                  Only initialized if <16
2289                                                  previous SGPRs initialized.
2290     then       Grid Work-Group Count Z    1      32 bit count of the number of
2291                (enable_sgpr_grid                 work-groups in the Z dimension
2292                _workgroup_count_Z &&             for the grid being
2293                less than 16 previous             executed. Computed from the
2294                SGPRs)                            fields in the kernel dispatch
2295                                                  packet as ((grid_size.z +
2296                                                  workgroup_size.z - 1) /
2297                                                  workgroupSize.z).
2298
2299                                                  Only initialized if <16
2300                                                  previous SGPRs initialized.
2301     then       Work-Group Id X            1      32 bit work-group id in X
2302                (enable_sgpr_workgroup_id         dimension of grid for
2303                _X)                               wavefront.
2304     then       Work-Group Id Y            1      32 bit work-group id in Y
2305                (enable_sgpr_workgroup_id         dimension of grid for
2306                _Y)                               wavefront.
2307     then       Work-Group Id Z            1      32 bit work-group id in Z
2308                (enable_sgpr_workgroup_id         dimension of grid for
2309                _Z)                               wavefront.
2310     then       Work-Group Info            1      {first_wavefront, 14'b0000,
2311                (enable_sgpr_workgroup            ordered_append_term[10:0],
2312                _info)                            threadgroup_size_in_wavefronts[5:0]}
2313     then       Scratch Wavefront Offset   1      32 bit byte offset from base
2314                (enable_sgpr_private              of scratch base of queue
2315                _segment_wavefront_offset)        executing the kernel
2316                                                  dispatch. Must be used as an
2317                                                  offset with Private
2318                                                  segment address when using
2319                                                  Scratch Segment Buffer. It
2320                                                  must be used to set up FLAT
2321                                                  SCRATCH for flat addressing
2322                                                  (see
2323                                                  :ref:`amdgpu-amdhsa-flat-scratch`).
2324     ========== ========================== ====== ==============================
2325
2326The order of the VGPR registers is defined, but the compiler can specify which
2327ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2328fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2329for enabled registers are dense starting at VGPR0: the first enabled register is
2330VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2331VGPR number.
2332
2333VGPR register initial state is defined in
2334:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2335
2336  .. table:: VGPR Register Set Up Order
2337     :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2338
2339     ========== ========================== ====== ==============================
2340     VGPR Order Name                       Number Description
2341                (kernel descriptor enable  of
2342                field)                     VGPRs
2343     ========== ========================== ====== ==============================
2344     First      Work-Item Id X             1      32 bit work item id in X
2345                (Always initialized)              dimension of work-group for
2346                                                  wavefront lane.
2347     then       Work-Item Id Y             1      32 bit work item id in Y
2348                (enable_vgpr_workitem_id          dimension of work-group for
2349                > 0)                              wavefront lane.
2350     then       Work-Item Id Z             1      32 bit work item id in Z
2351                (enable_vgpr_workitem_id          dimension of work-group for
2352                > 1)                              wavefront lane.
2353     ========== ========================== ====== ==============================
2354
2355The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
2356
23571. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2358   registers.
23592. Work-group Id registers X, Y, Z are set by ADC which supports any
2360   combination including none.
23613. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2362   its value cannot included with the flat scratch init value which is per queue.
23634. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2364   or (X, Y, Z).
2365
2366Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2367value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2368
2369The global segment can be accessed either using buffer instructions (GFX6 which
2370has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2371instructions (GFX9).
2372
2373If buffer operations are used then the compiler can generate a V# with the
2374following properties:
2375
2376* base address of 0
2377* no swizzle
2378* ATC: 1 if IOMMU present (such as APU)
2379* ptr64: 1
2380* MTYPE set to support memory coherence that matches the runtime (such as CC for
2381  APU and NC for dGPU).
2382
2383.. _amdgpu-amdhsa-kernel-prolog:
2384
2385Kernel Prolog
2386~~~~~~~~~~~~~
2387
2388.. _amdgpu-amdhsa-m0:
2389
2390M0
2391++
2392
2393GFX6-GFX8
2394  The M0 register must be initialized with a value at least the total LDS size
2395  if the kernel may access LDS via DS or flat operations. Total LDS size is
2396  available in dispatch packet. For M0, it is also possible to use maximum
2397  possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2398  GFX7-GFX8).
2399GFX9
2400  The M0 register is not used for range checking LDS accesses and so does not
2401  need to be initialized in the prolog.
2402
2403.. _amdgpu-amdhsa-flat-scratch:
2404
2405Flat Scratch
2406++++++++++++
2407
2408If the kernel may use flat operations to access scratch memory, the prolog code
2409must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2410are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
2411Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2412
2413GFX6
2414  Flat scratch is not supported.
2415
2416GFX7-GFX8
2417  1. The low word of Flat Scratch Init is 32 bit byte offset from
2418     ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2419     being managed by SPI for the queue executing the kernel dispatch. This is
2420     the same value used in the Scratch Segment Buffer V# base address. The
2421     prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
2422     scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2423     FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2424     by 8 before moving into FLAT_SCRATCH_LO.
2425  2. The second word of Flat Scratch Init is 32 bit byte size of a single
2426     work-items scratch memory usage. This is directly loaded from the kernel
2427     dispatch packet Private Segment Byte Size and rounded up to a multiple of
2428     DWORD. Having CP load it once avoids loading it at the beginning of every
2429     wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2430     SIZE.
2431
2432GFX9
2433  The Flat Scratch Init is the 64 bit address of the base of scratch backing
2434  memory being managed by SPI for the queue executing the kernel dispatch. The
2435  prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
2436  pair for use as the flat scratch base in flat memory instructions.
2437
2438.. _amdgpu-amdhsa-memory-model:
2439
2440Memory Model
2441~~~~~~~~~~~~
2442
2443This section describes the mapping of LLVM memory model onto AMDGPU machine code
2444(see :ref:`memmodel`). *The implementation is WIP.*
2445
2446.. TODO
2447   Update when implementation complete.
2448
2449The AMDGPU backend supports the memory synchronization scopes specified in
2450:ref:`amdgpu-memory-scopes`.
2451
2452The code sequences used to implement the memory model are defined in table
2453:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2454
2455The sequences specify the order of instructions that a single thread must
2456execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2457to other memory instructions executed by the same thread. This allows them to be
2458moved earlier or later which can allow them to be combined with other instances
2459of the same instruction, or hoisted/sunk out of loops to improve
2460performance. Only the instructions related to the memory model are given;
2461additional ``s_waitcnt`` instructions are required to ensure registers are
2462defined before being used. These may be able to be combined with the memory
2463model ``s_waitcnt`` instructions as described above.
2464
2465The AMDGPU backend supports the following memory models:
2466
2467  HSA Memory Model [HSA]_
2468    The HSA memory model uses a single happens-before relation for all address
2469    spaces (see :ref:`amdgpu-address-spaces`).
2470  OpenCL Memory Model [OpenCL]_
2471    The OpenCL memory model which has separate happens-before relations for the
2472    global and local address spaces. Only a fence specifying both global and
2473    local address space, and seq_cst instructions join the relationships. Since
2474    the LLVM ``memfence`` instruction does not allow an address space to be
2475    specified the OpenCL fence has to convervatively assume both local and
2476    global address space was specified. However, optimizations can often be
2477    done to eliminate the additional ``s_waitcnt`` instructions when there are
2478    no intervening memory instructions which access the corresponding address
2479    space. The code sequences in the table indicate what can be omitted for the
2480    OpenCL memory. The target triple environment is used to determine if the
2481    source language is OpenCL (see :ref:`amdgpu-opencl`).
2482
2483``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2484operations.
2485
2486``buffer/global/flat_load/store/atomic`` instructions to global memory are
2487termed vector memory operations.
2488
2489For GFX6-GFX9:
2490
2491* Each agent has multiple compute units (CU).
2492* Each CU has multiple SIMDs that execute wavefronts.
2493* The wavefronts for a single work-group are executed in the same CU but may be
2494  executed by different SIMDs.
2495* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2496  executing on it.
2497* All LDS operations of a CU are performed as wavefront wide operations in a
2498  global order and involve no caching. Completion is reported to a wavefront in
2499  execution order.
2500* The LDS memory has multiple request queues shared by the SIMDs of a
2501  CU. Therefore, the LDS operations performed by different wavefronts of a work-group
2502  can be reordered relative to each other, which can result in reordering the
2503  visibility of vector memory operations with respect to LDS operations of other
2504  wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2505  ensure synchronization between LDS operations and vector memory operations
2506  between wavefronts of a work-group, but not between operations performed by the
2507  same wavefront.
2508* The vector memory operations are performed as wavefront wide operations and
2509  completion is reported to a wavefront in execution order. The exception is
2510  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2511  vector memory order if they access LDS memory, and out of LDS operation order
2512  if they access global memory.
2513* The vector memory operations access a single vector L1 cache shared by all
2514  SIMDs a CU. Therefore, no special action is required for coherence between the
2515  lanes of a single wavefront, or for coherence between wavefronts in the same
2516  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
2517  executing in different work-groups as they may be executing on different CUs.
2518* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2519  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2520  scalar operations are used in a restricted way so do not impact the memory
2521  model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2522* The vector and scalar memory operations use an L2 cache shared by all CUs on
2523  the same agent.
2524* The L2 cache has independent channels to service disjoint ranges of virtual
2525  addresses.
2526* Each CU has a separate request queue per channel. Therefore, the vector and
2527  scalar memory operations performed by wavefronts executing in different work-groups
2528  (which may be executing on different CUs) of an agent can be reordered
2529  relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2530  synchronization between vector memory operations of different CUs. It ensures a
2531  previous vector memory operation has completed before executing a subsequent
2532  vector memory or LDS operation and so can be used to meet the requirements of
2533  acquire and release.
2534* The L2 cache can be kept coherent with other agents on some targets, or ranges
2535  of virtual addresses can be set up to bypass it to ensure system coherence.
2536
2537Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2538or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2539memory, atomic memory orderings are not meaningful and all accesses are treated
2540as non-atomic.
2541
2542Constant address space uses ``buffer/global_load`` instructions (or equivalent
2543scalar memory instructions). Since the constant address space contents do not
2544change during the execution of a kernel dispatch it is not legal to perform
2545stores, and atomic memory orderings are not meaningful and all access are
2546treated as non-atomic.
2547
2548A memory synchronization scope wider than work-group is not meaningful for the
2549group (LDS) address space and is treated as work-group.
2550
2551The memory model does not support the region address space which is treated as
2552non-atomic.
2553
2554Acquire memory ordering is not meaningful on store atomic instructions and is
2555treated as non-atomic.
2556
2557Release memory ordering is not meaningful on load atomic instructions and is
2558treated a non-atomic.
2559
2560Acquire-release memory ordering is not meaningful on load or store atomic
2561instructions and is treated as acquire and release respectively.
2562
2563AMDGPU backend only uses scalar memory operations to access memory that is
2564proven to not change during the execution of the kernel dispatch. This includes
2565constant address space and global address space for program scope const
2566variables. Therefore the kernel machine code does not have to maintain the
2567scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2568and vector L1 caches are invalidated between kernel dispatches by CP since
2569constant address space data may change between kernel dispatch executions. See
2570:ref:`amdgpu-amdhsa-memory-spaces`.
2571
2572The one execption is if scalar writes are used to spill SGPR registers. In this
2573case the AMDGPU backend ensures the memory location used to spill is never
2574accessed by vector memory operations at the same time. If scalar writes are used
2575then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2576return since the locations may be used for vector memory instructions by a
2577future wavefront that uses the same scratch area, or a function call that creates a
2578frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2579as all scalar writes are write-before-read in the same thread.
2580
2581Scratch backing memory (which is used for the private address space)
2582is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2583address space is only accessed by a single thread, and is always
2584write-before-read, there is never a need to invalidate these entries from the L1
2585cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2586volatile cache lines.
2587
2588On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2589to invalidate the L2 cache. This also causes it to be treated as
2590non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2591(cache coherent) and so the L2 cache will coherent with the CPU and other
2592agents.
2593
2594  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2595     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2596
2597     ============ ============ ============== ========== ===============================
2598     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2599                  Ordering     Sync Scope     Address
2600                                              Space
2601     ============ ============ ============== ========== ===============================
2602     **Non-Atomic**
2603     -----------------------------------------------------------------------------------
2604     load         *none*       *none*         - global   - !volatile & !nontemporal
2605                                              - generic
2606                                              - private    1. buffer/global/flat_load
2607                                              - constant
2608                                                         - volatile & !nontemporal
2609
2610                                                           1. buffer/global/flat_load
2611                                                              glc=1
2612
2613                                                         - nontemporal
2614
2615                                                           1. buffer/global/flat_load
2616                                                              glc=1 slc=1
2617
2618     load         *none*       *none*         - local    1. ds_load
2619     store        *none*       *none*         - global   - !nontemporal
2620                                              - generic
2621                                              - private    1. buffer/global/flat_store
2622                                              - constant
2623                                                         - nontemporal
2624
2625                                                           1. buffer/global/flat_stote
2626                                                              glc=1 slc=1
2627
2628     store        *none*       *none*         - local    1. ds_store
2629     **Unordered Atomic**
2630     -----------------------------------------------------------------------------------
2631     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2632     store atomic unordered    *any*          *any*      *Same as non-atomic*.
2633     atomicrmw    unordered    *any*          *any*      *Same as monotonic
2634                                                         atomic*.
2635     **Monotonic Atomic**
2636     -----------------------------------------------------------------------------------
2637     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2638                               - wavefront    - generic
2639                               - workgroup
2640     load atomic  monotonic    - singlethread - local    1. ds_load
2641                               - wavefront
2642                               - workgroup
2643     load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2644                               - system       - generic     glc=1
2645     store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2646                               - wavefront    - generic
2647                               - workgroup
2648                               - agent
2649                               - system
2650     store atomic monotonic    - singlethread - local    1. ds_store
2651                               - wavefront
2652                               - workgroup
2653     atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2654                               - wavefront    - generic
2655                               - workgroup
2656                               - agent
2657                               - system
2658     atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2659                               - wavefront
2660                               - workgroup
2661     **Acquire Atomic**
2662     -----------------------------------------------------------------------------------
2663     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2664                               - wavefront    - local
2665                                              - generic
2666     load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2667     load atomic  acquire      - workgroup    - local    1. ds_load
2668                                                         2. s_waitcnt lgkmcnt(0)
2669
2670                                                           - If OpenCL, omit.
2671                                                           - Must happen before
2672                                                             any following
2673                                                             global/generic
2674                                                             load/load
2675                                                             atomic/store/store
2676                                                             atomic/atomicrmw.
2677                                                           - Ensures any
2678                                                             following global
2679                                                             data read is no
2680                                                             older than the load
2681                                                             atomic value being
2682                                                             acquired.
2683     load atomic  acquire      - workgroup    - generic  1. flat_load
2684                                                         2. s_waitcnt lgkmcnt(0)
2685
2686                                                           - If OpenCL, omit.
2687                                                           - Must happen before
2688                                                             any following
2689                                                             global/generic
2690                                                             load/load
2691                                                             atomic/store/store
2692                                                             atomic/atomicrmw.
2693                                                           - Ensures any
2694                                                             following global
2695                                                             data read is no
2696                                                             older than the load
2697                                                             atomic value being
2698                                                             acquired.
2699     load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2700                               - system                     glc=1
2701                                                         2. s_waitcnt vmcnt(0)
2702
2703                                                           - Must happen before
2704                                                             following
2705                                                             buffer_wbinvl1_vol.
2706                                                           - Ensures the load
2707                                                             has completed
2708                                                             before invalidating
2709                                                             the cache.
2710
2711                                                         3. buffer_wbinvl1_vol
2712
2713                                                           - Must happen before
2714                                                             any following
2715                                                             global/generic
2716                                                             load/load
2717                                                             atomic/atomicrmw.
2718                                                           - Ensures that
2719                                                             following
2720                                                             loads will not see
2721                                                             stale global data.
2722
2723     load atomic  acquire      - agent        - generic  1. flat_load glc=1
2724                               - system                  2. s_waitcnt vmcnt(0) &
2725                                                            lgkmcnt(0)
2726
2727                                                           - If OpenCL omit
2728                                                             lgkmcnt(0).
2729                                                           - Must happen before
2730                                                             following
2731                                                             buffer_wbinvl1_vol.
2732                                                           - Ensures the flat_load
2733                                                             has completed
2734                                                             before invalidating
2735                                                             the cache.
2736
2737                                                         3. buffer_wbinvl1_vol
2738
2739                                                           - Must happen before
2740                                                             any following
2741                                                             global/generic
2742                                                             load/load
2743                                                             atomic/atomicrmw.
2744                                                           - Ensures that
2745                                                             following loads
2746                                                             will not see stale
2747                                                             global data.
2748
2749     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2750                               - wavefront    - local
2751                                              - generic
2752     atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2753     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2754                                                         2. waitcnt lgkmcnt(0)
2755
2756                                                           - If OpenCL, omit.
2757                                                           - Must happen before
2758                                                             any following
2759                                                             global/generic
2760                                                             load/load
2761                                                             atomic/store/store
2762                                                             atomic/atomicrmw.
2763                                                           - Ensures any
2764                                                             following global
2765                                                             data read is no
2766                                                             older than the
2767                                                             atomicrmw value
2768                                                             being acquired.
2769
2770     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2771                                                         2. waitcnt lgkmcnt(0)
2772
2773                                                           - If OpenCL, omit.
2774                                                           - Must happen before
2775                                                             any following
2776                                                             global/generic
2777                                                             load/load
2778                                                             atomic/store/store
2779                                                             atomic/atomicrmw.
2780                                                           - Ensures any
2781                                                             following global
2782                                                             data read is no
2783                                                             older than the
2784                                                             atomicrmw value
2785                                                             being acquired.
2786
2787     atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2788                               - system                  2. s_waitcnt vmcnt(0)
2789
2790                                                           - Must happen before
2791                                                             following
2792                                                             buffer_wbinvl1_vol.
2793                                                           - Ensures the
2794                                                             atomicrmw has
2795                                                             completed before
2796                                                             invalidating the
2797                                                             cache.
2798
2799                                                         3. buffer_wbinvl1_vol
2800
2801                                                           - Must happen before
2802                                                             any following
2803                                                             global/generic
2804                                                             load/load
2805                                                             atomic/atomicrmw.
2806                                                           - Ensures that
2807                                                             following loads
2808                                                             will not see stale
2809                                                             global data.
2810
2811     atomicrmw    acquire      - agent        - generic  1. flat_atomic
2812                               - system                  2. s_waitcnt vmcnt(0) &
2813                                                            lgkmcnt(0)
2814
2815                                                           - If OpenCL, omit
2816                                                             lgkmcnt(0).
2817                                                           - Must happen before
2818                                                             following
2819                                                             buffer_wbinvl1_vol.
2820                                                           - Ensures the
2821                                                             atomicrmw has
2822                                                             completed before
2823                                                             invalidating the
2824                                                             cache.
2825
2826                                                         3. buffer_wbinvl1_vol
2827
2828                                                           - Must happen before
2829                                                             any following
2830                                                             global/generic
2831                                                             load/load
2832                                                             atomic/atomicrmw.
2833                                                           - Ensures that
2834                                                             following loads
2835                                                             will not see stale
2836                                                             global data.
2837
2838     fence        acquire      - singlethread *none*     *none*
2839                               - wavefront
2840     fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2841
2842                                                           - If OpenCL and
2843                                                             address space is
2844                                                             not generic, omit.
2845                                                           - However, since LLVM
2846                                                             currently has no
2847                                                             address space on
2848                                                             the fence need to
2849                                                             conservatively
2850                                                             always generate. If
2851                                                             fence had an
2852                                                             address space then
2853                                                             set to address
2854                                                             space of OpenCL
2855                                                             fence flag, or to
2856                                                             generic if both
2857                                                             local and global
2858                                                             flags are
2859                                                             specified.
2860                                                           - Must happen after
2861                                                             any preceding
2862                                                             local/generic load
2863                                                             atomic/atomicrmw
2864                                                             with an equal or
2865                                                             wider sync scope
2866                                                             and memory ordering
2867                                                             stronger than
2868                                                             unordered (this is
2869                                                             termed the
2870                                                             fence-paired-atomic).
2871                                                           - Must happen before
2872                                                             any following
2873                                                             global/generic
2874                                                             load/load
2875                                                             atomic/store/store
2876                                                             atomic/atomicrmw.
2877                                                           - Ensures any
2878                                                             following global
2879                                                             data read is no
2880                                                             older than the
2881                                                             value read by the
2882                                                             fence-paired-atomic.
2883
2884     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2885                               - system                     vmcnt(0)
2886
2887                                                           - If OpenCL and
2888                                                             address space is
2889                                                             not generic, omit
2890                                                             lgkmcnt(0).
2891                                                           - However, since LLVM
2892                                                             currently has no
2893                                                             address space on
2894                                                             the fence need to
2895                                                             conservatively
2896                                                             always generate
2897                                                             (see comment for
2898                                                             previous fence).
2899                                                           - Could be split into
2900                                                             separate s_waitcnt
2901                                                             vmcnt(0) and
2902                                                             s_waitcnt
2903                                                             lgkmcnt(0) to allow
2904                                                             them to be
2905                                                             independently moved
2906                                                             according to the
2907                                                             following rules.
2908                                                           - s_waitcnt vmcnt(0)
2909                                                             must happen after
2910                                                             any preceding
2911                                                             global/generic load
2912                                                             atomic/atomicrmw
2913                                                             with an equal or
2914                                                             wider sync scope
2915                                                             and memory ordering
2916                                                             stronger than
2917                                                             unordered (this is
2918                                                             termed the
2919                                                             fence-paired-atomic).
2920                                                           - s_waitcnt lgkmcnt(0)
2921                                                             must happen after
2922                                                             any preceding
2923                                                             local/generic load
2924                                                             atomic/atomicrmw
2925                                                             with an equal or
2926                                                             wider sync scope
2927                                                             and memory ordering
2928                                                             stronger than
2929                                                             unordered (this is
2930                                                             termed the
2931                                                             fence-paired-atomic).
2932                                                           - Must happen before
2933                                                             the following
2934                                                             buffer_wbinvl1_vol.
2935                                                           - Ensures that the
2936                                                             fence-paired atomic
2937                                                             has completed
2938                                                             before invalidating
2939                                                             the
2940                                                             cache. Therefore
2941                                                             any following
2942                                                             locations read must
2943                                                             be no older than
2944                                                             the value read by
2945                                                             the
2946                                                             fence-paired-atomic.
2947
2948                                                         2. buffer_wbinvl1_vol
2949
2950                                                           - Must happen before any
2951                                                             following global/generic
2952                                                             load/load
2953                                                             atomic/store/store
2954                                                             atomic/atomicrmw.
2955                                                           - Ensures that
2956                                                             following loads
2957                                                             will not see stale
2958                                                             global data.
2959
2960     **Release Atomic**
2961     -----------------------------------------------------------------------------------
2962     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2963                               - wavefront    - local
2964                                              - generic
2965     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2966
2967                                                           - If OpenCL, omit.
2968                                                           - Must happen after
2969                                                             any preceding
2970                                                             local/generic
2971                                                             load/store/load
2972                                                             atomic/store
2973                                                             atomic/atomicrmw.
2974                                                           - Must happen before
2975                                                             the following
2976                                                             store.
2977                                                           - Ensures that all
2978                                                             memory operations
2979                                                             to local have
2980                                                             completed before
2981                                                             performing the
2982                                                             store that is being
2983                                                             released.
2984
2985                                                         2. buffer/global/flat_store
2986     store atomic release      - workgroup    - local    1. ds_store
2987     store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2988
2989                                                           - If OpenCL, omit.
2990                                                           - Must happen after
2991                                                             any preceding
2992                                                             local/generic
2993                                                             load/store/load
2994                                                             atomic/store
2995                                                             atomic/atomicrmw.
2996                                                           - Must happen before
2997                                                             the following
2998                                                             store.
2999                                                           - Ensures that all
3000                                                             memory operations
3001                                                             to local have
3002                                                             completed before
3003                                                             performing the
3004                                                             store that is being
3005                                                             released.
3006
3007                                                         2. flat_store
3008     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3009                               - system       - generic     vmcnt(0)
3010
3011                                                           - If OpenCL, omit
3012                                                             lgkmcnt(0).
3013                                                           - Could be split into
3014                                                             separate s_waitcnt
3015                                                             vmcnt(0) and
3016                                                             s_waitcnt
3017                                                             lgkmcnt(0) to allow
3018                                                             them to be
3019                                                             independently moved
3020                                                             according to the
3021                                                             following rules.
3022                                                           - s_waitcnt vmcnt(0)
3023                                                             must happen after
3024                                                             any preceding
3025                                                             global/generic
3026                                                             load/store/load
3027                                                             atomic/store
3028                                                             atomic/atomicrmw.
3029                                                           - s_waitcnt lgkmcnt(0)
3030                                                             must happen after
3031                                                             any preceding
3032                                                             local/generic
3033                                                             load/store/load
3034                                                             atomic/store
3035                                                             atomic/atomicrmw.
3036                                                           - Must happen before
3037                                                             the following
3038                                                             store.
3039                                                           - Ensures that all
3040                                                             memory operations
3041                                                             to memory have
3042                                                             completed before
3043                                                             performing the
3044                                                             store that is being
3045                                                             released.
3046
3047                                                         2. buffer/global/ds/flat_store
3048     atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
3049                               - wavefront    - local
3050                                              - generic
3051     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3052
3053                                                           - If OpenCL, omit.
3054                                                           - Must happen after
3055                                                             any preceding
3056                                                             local/generic
3057                                                             load/store/load
3058                                                             atomic/store
3059                                                             atomic/atomicrmw.
3060                                                           - Must happen before
3061                                                             the following
3062                                                             atomicrmw.
3063                                                           - Ensures that all
3064                                                             memory operations
3065                                                             to local have
3066                                                             completed before
3067                                                             performing the
3068                                                             atomicrmw that is
3069                                                             being released.
3070
3071                                                         2. buffer/global/flat_atomic
3072     atomicrmw    release      - workgroup    - local    1. ds_atomic
3073     atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3074
3075                                                           - If OpenCL, omit.
3076                                                           - Must happen after
3077                                                             any preceding
3078                                                             local/generic
3079                                                             load/store/load
3080                                                             atomic/store
3081                                                             atomic/atomicrmw.
3082                                                           - Must happen before
3083                                                             the following
3084                                                             atomicrmw.
3085                                                           - Ensures that all
3086                                                             memory operations
3087                                                             to local have
3088                                                             completed before
3089                                                             performing the
3090                                                             atomicrmw that is
3091                                                             being released.
3092
3093                                                         2. flat_atomic
3094     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3095                               - system       - generic     vmcnt(0)
3096
3097                                                           - If OpenCL, omit
3098                                                             lgkmcnt(0).
3099                                                           - Could be split into
3100                                                             separate s_waitcnt
3101                                                             vmcnt(0) and
3102                                                             s_waitcnt
3103                                                             lgkmcnt(0) to allow
3104                                                             them to be
3105                                                             independently moved
3106                                                             according to the
3107                                                             following rules.
3108                                                           - s_waitcnt vmcnt(0)
3109                                                             must happen after
3110                                                             any preceding
3111                                                             global/generic
3112                                                             load/store/load
3113                                                             atomic/store
3114                                                             atomic/atomicrmw.
3115                                                           - s_waitcnt lgkmcnt(0)
3116                                                             must happen after
3117                                                             any preceding
3118                                                             local/generic
3119                                                             load/store/load
3120                                                             atomic/store
3121                                                             atomic/atomicrmw.
3122                                                           - Must happen before
3123                                                             the following
3124                                                             atomicrmw.
3125                                                           - Ensures that all
3126                                                             memory operations
3127                                                             to global and local
3128                                                             have completed
3129                                                             before performing
3130                                                             the atomicrmw that
3131                                                             is being released.
3132
3133                                                         2. buffer/global/ds/flat_atomic
3134     fence        release      - singlethread *none*     *none*
3135                               - wavefront
3136     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3137
3138                                                           - If OpenCL and
3139                                                             address space is
3140                                                             not generic, omit.
3141                                                           - However, since LLVM
3142                                                             currently has no
3143                                                             address space on
3144                                                             the fence need to
3145                                                             conservatively
3146                                                             always generate. If
3147                                                             fence had an
3148                                                             address space then
3149                                                             set to address
3150                                                             space of OpenCL
3151                                                             fence flag, or to
3152                                                             generic if both
3153                                                             local and global
3154                                                             flags are
3155                                                             specified.
3156                                                           - Must happen after
3157                                                             any preceding
3158                                                             local/generic
3159                                                             load/load
3160                                                             atomic/store/store
3161                                                             atomic/atomicrmw.
3162                                                           - Must happen before
3163                                                             any following store
3164                                                             atomic/atomicrmw
3165                                                             with an equal or
3166                                                             wider sync scope
3167                                                             and memory ordering
3168                                                             stronger than
3169                                                             unordered (this is
3170                                                             termed the
3171                                                             fence-paired-atomic).
3172                                                           - Ensures that all
3173                                                             memory operations
3174                                                             to local have
3175                                                             completed before
3176                                                             performing the
3177                                                             following
3178                                                             fence-paired-atomic.
3179
3180     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3181                               - system                     vmcnt(0)
3182
3183                                                           - If OpenCL and
3184                                                             address space is
3185                                                             not generic, omit
3186                                                             lgkmcnt(0).
3187                                                           - If OpenCL and
3188                                                             address space is
3189                                                             local, omit
3190                                                             vmcnt(0).
3191                                                           - However, since LLVM
3192                                                             currently has no
3193                                                             address space on
3194                                                             the fence need to
3195                                                             conservatively
3196                                                             always generate. If
3197                                                             fence had an
3198                                                             address space then
3199                                                             set to address
3200                                                             space of OpenCL
3201                                                             fence flag, or to
3202                                                             generic if both
3203                                                             local and global
3204                                                             flags are
3205                                                             specified.
3206                                                           - Could be split into
3207                                                             separate s_waitcnt
3208                                                             vmcnt(0) and
3209                                                             s_waitcnt
3210                                                             lgkmcnt(0) to allow
3211                                                             them to be
3212                                                             independently moved
3213                                                             according to the
3214                                                             following rules.
3215                                                           - s_waitcnt vmcnt(0)
3216                                                             must happen after
3217                                                             any preceding
3218                                                             global/generic
3219                                                             load/store/load
3220                                                             atomic/store
3221                                                             atomic/atomicrmw.
3222                                                           - s_waitcnt lgkmcnt(0)
3223                                                             must happen after
3224                                                             any preceding
3225                                                             local/generic
3226                                                             load/store/load
3227                                                             atomic/store
3228                                                             atomic/atomicrmw.
3229                                                           - Must happen before
3230                                                             any following store
3231                                                             atomic/atomicrmw
3232                                                             with an equal or
3233                                                             wider sync scope
3234                                                             and memory ordering
3235                                                             stronger than
3236                                                             unordered (this is
3237                                                             termed the
3238                                                             fence-paired-atomic).
3239                                                           - Ensures that all
3240                                                             memory operations
3241                                                             have
3242                                                             completed before
3243                                                             performing the
3244                                                             following
3245                                                             fence-paired-atomic.
3246
3247     **Acquire-Release Atomic**
3248     -----------------------------------------------------------------------------------
3249     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3250                               - wavefront    - local
3251                                              - generic
3252     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3253
3254                                                           - If OpenCL, omit.
3255                                                           - Must happen after
3256                                                             any preceding
3257                                                             local/generic
3258                                                             load/store/load
3259                                                             atomic/store
3260                                                             atomic/atomicrmw.
3261                                                           - Must happen before
3262                                                             the following
3263                                                             atomicrmw.
3264                                                           - Ensures that all
3265                                                             memory operations
3266                                                             to local have
3267                                                             completed before
3268                                                             performing the
3269                                                             atomicrmw that is
3270                                                             being released.
3271
3272                                                         2. buffer/global/flat_atomic
3273     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3274                                                         2. s_waitcnt lgkmcnt(0)
3275
3276                                                           - If OpenCL, omit.
3277                                                           - Must happen before
3278                                                             any following
3279                                                             global/generic
3280                                                             load/load
3281                                                             atomic/store/store
3282                                                             atomic/atomicrmw.
3283                                                           - Ensures any
3284                                                             following global
3285                                                             data read is no
3286                                                             older than the load
3287                                                             atomic value being
3288                                                             acquired.
3289
3290     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3291
3292                                                           - If OpenCL, omit.
3293                                                           - Must happen after
3294                                                             any preceding
3295                                                             local/generic
3296                                                             load/store/load
3297                                                             atomic/store
3298                                                             atomic/atomicrmw.
3299                                                           - Must happen before
3300                                                             the following
3301                                                             atomicrmw.
3302                                                           - Ensures that all
3303                                                             memory operations
3304                                                             to local have
3305                                                             completed before
3306                                                             performing the
3307                                                             atomicrmw that is
3308                                                             being released.
3309
3310                                                         2. flat_atomic
3311                                                         3. s_waitcnt lgkmcnt(0)
3312
3313                                                           - If OpenCL, omit.
3314                                                           - Must happen before
3315                                                             any following
3316                                                             global/generic
3317                                                             load/load
3318                                                             atomic/store/store
3319                                                             atomic/atomicrmw.
3320                                                           - Ensures any
3321                                                             following global
3322                                                             data read is no
3323                                                             older than the load
3324                                                             atomic value being
3325                                                             acquired.
3326
3327     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3328                               - system                     vmcnt(0)
3329
3330                                                           - If OpenCL, omit
3331                                                             lgkmcnt(0).
3332                                                           - Could be split into
3333                                                             separate s_waitcnt
3334                                                             vmcnt(0) and
3335                                                             s_waitcnt
3336                                                             lgkmcnt(0) to allow
3337                                                             them to be
3338                                                             independently moved
3339                                                             according to the
3340                                                             following rules.
3341                                                           - s_waitcnt vmcnt(0)
3342                                                             must happen after
3343                                                             any preceding
3344                                                             global/generic
3345                                                             load/store/load
3346                                                             atomic/store
3347                                                             atomic/atomicrmw.
3348                                                           - s_waitcnt lgkmcnt(0)
3349                                                             must happen after
3350                                                             any preceding
3351                                                             local/generic
3352                                                             load/store/load
3353                                                             atomic/store
3354                                                             atomic/atomicrmw.
3355                                                           - Must happen before
3356                                                             the following
3357                                                             atomicrmw.
3358                                                           - Ensures that all
3359                                                             memory operations
3360                                                             to global have
3361                                                             completed before
3362                                                             performing the
3363                                                             atomicrmw that is
3364                                                             being released.
3365
3366                                                         2. buffer/global/flat_atomic
3367                                                         3. s_waitcnt vmcnt(0)
3368
3369                                                           - Must happen before
3370                                                             following
3371                                                             buffer_wbinvl1_vol.
3372                                                           - Ensures the
3373                                                             atomicrmw has
3374                                                             completed before
3375                                                             invalidating the
3376                                                             cache.
3377
3378                                                         4. buffer_wbinvl1_vol
3379
3380                                                           - Must happen before
3381                                                             any following
3382                                                             global/generic
3383                                                             load/load
3384                                                             atomic/atomicrmw.
3385                                                           - Ensures that
3386                                                             following loads
3387                                                             will not see stale
3388                                                             global data.
3389
3390     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3391                               - system                     vmcnt(0)
3392
3393                                                           - If OpenCL, omit
3394                                                             lgkmcnt(0).
3395                                                           - Could be split into
3396                                                             separate s_waitcnt
3397                                                             vmcnt(0) and
3398                                                             s_waitcnt
3399                                                             lgkmcnt(0) to allow
3400                                                             them to be
3401                                                             independently moved
3402                                                             according to the
3403                                                             following rules.
3404                                                           - s_waitcnt vmcnt(0)
3405                                                             must happen after
3406                                                             any preceding
3407                                                             global/generic
3408                                                             load/store/load
3409                                                             atomic/store
3410                                                             atomic/atomicrmw.
3411                                                           - s_waitcnt lgkmcnt(0)
3412                                                             must happen after
3413                                                             any preceding
3414                                                             local/generic
3415                                                             load/store/load
3416                                                             atomic/store
3417                                                             atomic/atomicrmw.
3418                                                           - Must happen before
3419                                                             the following
3420                                                             atomicrmw.
3421                                                           - Ensures that all
3422                                                             memory operations
3423                                                             to global have
3424                                                             completed before
3425                                                             performing the
3426                                                             atomicrmw that is
3427                                                             being released.
3428
3429                                                         2. flat_atomic
3430                                                         3. s_waitcnt vmcnt(0) &
3431                                                            lgkmcnt(0)
3432
3433                                                           - If OpenCL, omit
3434                                                             lgkmcnt(0).
3435                                                           - Must happen before
3436                                                             following
3437                                                             buffer_wbinvl1_vol.
3438                                                           - Ensures the
3439                                                             atomicrmw has
3440                                                             completed before
3441                                                             invalidating the
3442                                                             cache.
3443
3444                                                         4. buffer_wbinvl1_vol
3445
3446                                                           - Must happen before
3447                                                             any following
3448                                                             global/generic
3449                                                             load/load
3450                                                             atomic/atomicrmw.
3451                                                           - Ensures that
3452                                                             following loads
3453                                                             will not see stale
3454                                                             global data.
3455
3456     fence        acq_rel      - singlethread *none*     *none*
3457                               - wavefront
3458     fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3459
3460                                                           - If OpenCL and
3461                                                             address space is
3462                                                             not generic, omit.
3463                                                           - However,
3464                                                             since LLVM
3465                                                             currently has no
3466                                                             address space on
3467                                                             the fence need to
3468                                                             conservatively
3469                                                             always generate
3470                                                             (see comment for
3471                                                             previous fence).
3472                                                           - Must happen after
3473                                                             any preceding
3474                                                             local/generic
3475                                                             load/load
3476                                                             atomic/store/store
3477                                                             atomic/atomicrmw.
3478                                                           - Must happen before
3479                                                             any following
3480                                                             global/generic
3481                                                             load/load
3482                                                             atomic/store/store
3483                                                             atomic/atomicrmw.
3484                                                           - Ensures that all
3485                                                             memory operations
3486                                                             to local have
3487                                                             completed before
3488                                                             performing any
3489                                                             following global
3490                                                             memory operations.
3491                                                           - Ensures that the
3492                                                             preceding
3493                                                             local/generic load
3494                                                             atomic/atomicrmw
3495                                                             with an equal or
3496                                                             wider sync scope
3497                                                             and memory ordering
3498                                                             stronger than
3499                                                             unordered (this is
3500                                                             termed the
3501                                                             acquire-fence-paired-atomic
3502                                                             ) has completed
3503                                                             before following
3504                                                             global memory
3505                                                             operations. This
3506                                                             satisfies the
3507                                                             requirements of
3508                                                             acquire.
3509                                                           - Ensures that all
3510                                                             previous memory
3511                                                             operations have
3512                                                             completed before a
3513                                                             following
3514                                                             local/generic store
3515                                                             atomic/atomicrmw
3516                                                             with an equal or
3517                                                             wider sync scope
3518                                                             and memory ordering
3519                                                             stronger than
3520                                                             unordered (this is
3521                                                             termed the
3522                                                             release-fence-paired-atomic
3523                                                             ). This satisfies the
3524                                                             requirements of
3525                                                             release.
3526
3527     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3528                               - system                     vmcnt(0)
3529
3530                                                           - If OpenCL and
3531                                                             address space is
3532                                                             not generic, omit
3533                                                             lgkmcnt(0).
3534                                                           - However, since LLVM
3535                                                             currently has no
3536                                                             address space on
3537                                                             the fence need to
3538                                                             conservatively
3539                                                             always generate
3540                                                             (see comment for
3541                                                             previous fence).
3542                                                           - Could be split into
3543                                                             separate s_waitcnt
3544                                                             vmcnt(0) and
3545                                                             s_waitcnt
3546                                                             lgkmcnt(0) to allow
3547                                                             them to be
3548                                                             independently moved
3549                                                             according to the
3550                                                             following rules.
3551                                                           - s_waitcnt vmcnt(0)
3552                                                             must happen after
3553                                                             any preceding
3554                                                             global/generic
3555                                                             load/store/load
3556                                                             atomic/store
3557                                                             atomic/atomicrmw.
3558                                                           - s_waitcnt lgkmcnt(0)
3559                                                             must happen after
3560                                                             any preceding
3561                                                             local/generic
3562                                                             load/store/load
3563                                                             atomic/store
3564                                                             atomic/atomicrmw.
3565                                                           - Must happen before
3566                                                             the following
3567                                                             buffer_wbinvl1_vol.
3568                                                           - Ensures that the
3569                                                             preceding
3570                                                             global/local/generic
3571                                                             load
3572                                                             atomic/atomicrmw
3573                                                             with an equal or
3574                                                             wider sync scope
3575                                                             and memory ordering
3576                                                             stronger than
3577                                                             unordered (this is
3578                                                             termed the
3579                                                             acquire-fence-paired-atomic
3580                                                             ) has completed
3581                                                             before invalidating
3582                                                             the cache. This
3583                                                             satisfies the
3584                                                             requirements of
3585                                                             acquire.
3586                                                           - Ensures that all
3587                                                             previous memory
3588                                                             operations have
3589                                                             completed before a
3590                                                             following
3591                                                             global/local/generic
3592                                                             store
3593                                                             atomic/atomicrmw
3594                                                             with an equal or
3595                                                             wider sync scope
3596                                                             and memory ordering
3597                                                             stronger than
3598                                                             unordered (this is
3599                                                             termed the
3600                                                             release-fence-paired-atomic
3601                                                             ). This satisfies the
3602                                                             requirements of
3603                                                             release.
3604
3605                                                         2. buffer_wbinvl1_vol
3606
3607                                                           - Must happen before
3608                                                             any following
3609                                                             global/generic
3610                                                             load/load
3611                                                             atomic/store/store
3612                                                             atomic/atomicrmw.
3613                                                           - Ensures that
3614                                                             following loads
3615                                                             will not see stale
3616                                                             global data. This
3617                                                             satisfies the
3618                                                             requirements of
3619                                                             acquire.
3620
3621     **Sequential Consistent Atomic**
3622     -----------------------------------------------------------------------------------
3623     load atomic  seq_cst      - singlethread - global   *Same as corresponding
3624                               - wavefront    - local    load atomic acquire,
3625                                              - generic  except must generated
3626                                                         all instructions even
3627                                                         for OpenCL.*
3628     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3629                                              - generic
3630                                                           - Must
3631                                                             happen after
3632                                                             preceding
3633                                                             global/generic load
3634                                                             atomic/store
3635                                                             atomic/atomicrmw
3636                                                             with memory
3637                                                             ordering of seq_cst
3638                                                             and with equal or
3639                                                             wider sync scope.
3640                                                             (Note that seq_cst
3641                                                             fences have their
3642                                                             own s_waitcnt
3643                                                             lgkmcnt(0) and so do
3644                                                             not need to be
3645                                                             considered.)
3646                                                           - Ensures any
3647                                                             preceding
3648                                                             sequential
3649                                                             consistent local
3650                                                             memory instructions
3651                                                             have completed
3652                                                             before executing
3653                                                             this sequentially
3654                                                             consistent
3655                                                             instruction. This
3656                                                             prevents reordering
3657                                                             a seq_cst store
3658                                                             followed by a
3659                                                             seq_cst load. (Note
3660                                                             that seq_cst is
3661                                                             stronger than
3662                                                             acquire/release as
3663                                                             the reordering of
3664                                                             load acquire
3665                                                             followed by a store
3666                                                             release is
3667                                                             prevented by the
3668                                                             waitcnt of
3669                                                             the release, but
3670                                                             there is nothing
3671                                                             preventing a store
3672                                                             release followed by
3673                                                             load acquire from
3674                                                             competing out of
3675                                                             order.)
3676
3677                                                         2. *Following
3678                                                            instructions same as
3679                                                            corresponding load
3680                                                            atomic acquire,
3681                                                            except must generated
3682                                                            all instructions even
3683                                                            for OpenCL.*
3684     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3685                                                         load atomic acquire,
3686                                                         except must generated
3687                                                         all instructions even
3688                                                         for OpenCL.*
3689     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3690                               - system       - generic     vmcnt(0)
3691
3692                                                           - Could be split into
3693                                                             separate s_waitcnt
3694                                                             vmcnt(0)
3695                                                             and s_waitcnt
3696                                                             lgkmcnt(0) to allow
3697                                                             them to be
3698                                                             independently moved
3699                                                             according to the
3700                                                             following rules.
3701                                                           - waitcnt lgkmcnt(0)
3702                                                             must happen after
3703                                                             preceding
3704                                                             global/generic load
3705                                                             atomic/store
3706                                                             atomic/atomicrmw
3707                                                             with memory
3708                                                             ordering of seq_cst
3709                                                             and with equal or
3710                                                             wider sync scope.
3711                                                             (Note that seq_cst
3712                                                             fences have their
3713                                                             own s_waitcnt
3714                                                             lgkmcnt(0) and so do
3715                                                             not need to be
3716                                                             considered.)
3717                                                           - waitcnt vmcnt(0)
3718                                                             must happen after
3719                                                             preceding
3720                                                             global/generic load
3721                                                             atomic/store
3722                                                             atomic/atomicrmw
3723                                                             with memory
3724                                                             ordering of seq_cst
3725                                                             and with equal or
3726                                                             wider sync scope.
3727                                                             (Note that seq_cst
3728                                                             fences have their
3729                                                             own s_waitcnt
3730                                                             vmcnt(0) and so do
3731                                                             not need to be
3732                                                             considered.)
3733                                                           - Ensures any
3734                                                             preceding
3735                                                             sequential
3736                                                             consistent global
3737                                                             memory instructions
3738                                                             have completed
3739                                                             before executing
3740                                                             this sequentially
3741                                                             consistent
3742                                                             instruction. This
3743                                                             prevents reordering
3744                                                             a seq_cst store
3745                                                             followed by a
3746                                                             seq_cst load. (Note
3747                                                             that seq_cst is
3748                                                             stronger than
3749                                                             acquire/release as
3750                                                             the reordering of
3751                                                             load acquire
3752                                                             followed by a store
3753                                                             release is
3754                                                             prevented by the
3755                                                             waitcnt of
3756                                                             the release, but
3757                                                             there is nothing
3758                                                             preventing a store
3759                                                             release followed by
3760                                                             load acquire from
3761                                                             competing out of
3762                                                             order.)
3763
3764                                                         2. *Following
3765                                                            instructions same as
3766                                                            corresponding load
3767                                                            atomic acquire,
3768                                                            except must generated
3769                                                            all instructions even
3770                                                            for OpenCL.*
3771     store atomic seq_cst      - singlethread - global   *Same as corresponding
3772                               - wavefront    - local    store atomic release,
3773                               - workgroup    - generic  except must generated
3774                                                         all instructions even
3775                                                         for OpenCL.*
3776     store atomic seq_cst      - agent        - global   *Same as corresponding
3777                               - system       - generic  store atomic release,
3778                                                         except must generated
3779                                                         all instructions even
3780                                                         for OpenCL.*
3781     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3782                               - wavefront    - local    atomicrmw acq_rel,
3783                               - workgroup    - generic  except must generated
3784                                                         all instructions even
3785                                                         for OpenCL.*
3786     atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3787                               - system       - generic  atomicrmw acq_rel,
3788                                                         except must generated
3789                                                         all instructions even
3790                                                         for OpenCL.*
3791     fence        seq_cst      - singlethread *none*     *Same as corresponding
3792                               - wavefront               fence acq_rel,
3793                               - workgroup               except must generated
3794                               - agent                   all instructions even
3795                               - system                  for OpenCL.*
3796     ============ ============ ============== ========== ===============================
3797
3798The memory order also adds the single thread optimization constrains defined in
3799table
3800:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3801
3802  .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3803     :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3804
3805     ============ ==============================================================
3806     LLVM Memory  Optimization Constraints
3807     Ordering
3808     ============ ==============================================================
3809     unordered    *none*
3810     monotonic    *none*
3811     acquire      - If a load atomic/atomicrmw then no following load/load
3812                    atomic/store/ store atomic/atomicrmw/fence instruction can
3813                    be moved before the acquire.
3814                  - If a fence then same as load atomic, plus no preceding
3815                    associated fence-paired-atomic can be moved after the fence.
3816     release      - If a store atomic/atomicrmw then no preceding load/load
3817                    atomic/store/ store atomic/atomicrmw/fence instruction can
3818                    be moved after the release.
3819                  - If a fence then same as store atomic, plus no following
3820                    associated fence-paired-atomic can be moved before the
3821                    fence.
3822     acq_rel      Same constraints as both acquire and release.
3823     seq_cst      - If a load atomic then same constraints as acquire, plus no
3824                    preceding sequentially consistent load atomic/store
3825                    atomic/atomicrmw/fence instruction can be moved after the
3826                    seq_cst.
3827                  - If a store atomic then the same constraints as release, plus
3828                    no following sequentially consistent load atomic/store
3829                    atomic/atomicrmw/fence instruction can be moved before the
3830                    seq_cst.
3831                  - If an atomicrmw/fence then same constraints as acq_rel.
3832     ============ ==============================================================
3833
3834Trap Handler ABI
3835~~~~~~~~~~~~~~~~
3836
3837For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3838(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3839the ``s_trap`` instruction with the following usage:
3840
3841  .. table:: AMDGPU Trap Handler for AMDHSA OS
3842     :name: amdgpu-trap-handler-for-amdhsa-os-table
3843
3844     =================== =============== =============== =======================
3845     Usage               Code Sequence   Trap Handler    Description
3846                                         Inputs
3847     =================== =============== =============== =======================
3848     reserved            ``s_trap 0x00``                 Reserved by hardware.
3849     ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3850                                           ``queue_ptr`` ``debugtrap``
3851                                         ``VGPR0``:      intrinsic (not
3852                                           ``arg``       implemented).
3853     ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3854                                           ``queue_ptr`` terminated and its
3855                                                         associated queue put
3856                                                         into the error state.
3857     ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not
3858                                                           installed then
3859                                                           behaves as a
3860                                                           no-operation. The
3861                                                           trap handler is
3862                                                           entered and
3863                                                           immediately returns
3864                                                           to continue
3865                                                           execution of the
3866                                                           wavefront.
3867                                                         - If the debugger is
3868                                                           installed, causes
3869                                                           the debug trap to be
3870                                                           reported by the
3871                                                           debugger and the
3872                                                           wavefront is put in
3873                                                           the halt state until
3874                                                           resumed by the
3875                                                           debugger.
3876     reserved            ``s_trap 0x04``                 Reserved.
3877     reserved            ``s_trap 0x05``                 Reserved.
3878     reserved            ``s_trap 0x06``                 Reserved.
3879     debugger breakpoint ``s_trap 0x07``                 Reserved for debugger
3880                                                         breakpoints.
3881     reserved            ``s_trap 0x08``                 Reserved.
3882     reserved            ``s_trap 0xfe``                 Reserved.
3883     reserved            ``s_trap 0xff``                 Reserved.
3884     =================== =============== =============== =======================
3885
3886AMDPAL
3887------
3888
3889This section provides code conventions used when the target triple OS is
3890``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3891from the application/runtime to each invocation of a hardware shader. These
3892parameters include both generic, application-controlled parameters called
3893*user data* as well as system-generated parameters that are a product of the
3894draw or dispatch execution.
3895
3896User Data
3897~~~~~~~~~
3898
3899Each hardware stage has a set of 32-bit *user data registers* which can be
3900written from a command buffer and then loaded into SGPRs when waves are launched
3901via a subsequent dispatch or draw operation. This is the way most arguments are
3902passed from the application/runtime to a hardware shader.
3903
3904Compute User Data
3905~~~~~~~~~~~~~~~~~
3906
3907Compute shader user data mappings are simpler than graphics shaders, and have a
3908fixed mapping.
3909
3910Note that there are always 10 available *user data entries* in registers -
3911entries beyond that limit must be fetched from memory (via the spill table
3912pointer) by the shader.
3913
3914  .. table:: PAL Compute Shader User Data Registers
3915     :name: pal-compute-user-data-registers
3916
3917     ============= ================================
3918     User Register Description
3919     ============= ================================
3920     0             Global Internal Table (32-bit pointer)
3921     1             Per-Shader Internal Table (32-bit pointer)
3922     2 - 11        Application-Controlled User Data (10 32-bit values)
3923     12            Spill Table (32-bit pointer)
3924     13 - 14       Thread Group Count (64-bit pointer)
3925     15            GDS Range
3926     ============= ================================
3927
3928Graphics User Data
3929~~~~~~~~~~~~~~~~~~
3930
3931Graphics pipelines support a much more flexible user data mapping:
3932
3933  .. table:: PAL Graphics Shader User Data Registers
3934     :name: pal-graphics-user-data-registers
3935
3936     ============= ================================
3937     User Register Description
3938     ============= ================================
3939     0             Global Internal Table (32-bit pointer)
3940     +             Per-Shader Internal Table (32-bit pointer)
3941     + 1-15        Application Controlled User Data
3942                   (1-15 Contiguous 32-bit Values in Registers)
3943     +             Spill Table (32-bit pointer)
3944     +             Draw Index (First Stage Only)
3945     +             Vertex Offset (First Stage Only)
3946     +             Instance Offset (First Stage Only)
3947     ============= ================================
3948
3949  The placement of the global internal table remains fixed in the first *user
3950  data SGPR register*. Otherwise all parameters are optional, and can be mapped
3951  to any desired *user data SGPR register*, with the following regstrictions:
3952
3953  * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3954    activehardware stage in a graphics pipeline (i.e. where the API vertex
3955    shader runs).
3956
3957  * Application-controlled user data must be mapped into a contiguous range of
3958    user data registers.
3959
3960  * The application-controlled user data range supports compaction remapping, so
3961    only *entries* that are actually consumed by the shader must be assigned to
3962    corresponding *registers*. Note that in order to support an efficient runtime
3963    implementation, the remapping must pack *registers* in the same order as
3964    *entries*, with unused *entries* removed.
3965
3966.. _pal_global_internal_table:
3967
3968Global Internal Table
3969~~~~~~~~~~~~~~~~~~~~~
3970
3971The global internal table is a table of *shader resource descriptors* (SRDs) that
3972define how certain engine-wide, runtime-managed resources should be accessed
3973from a shader. The majority of these resources have HW-defined formats, and it
3974is up to the compiler to write/read data as required by the target hardware.
3975
3976The following table illustrates the required format:
3977
3978  .. table:: PAL Global Internal Table
3979     :name: pal-git-table
3980
3981     ============= ================================
3982     Offset        Description
3983     ============= ================================
3984     0-3           Graphics Scratch SRD
3985     4-7           Compute Scratch SRD
3986     8-11          ES/GS Ring Output SRD
3987     12-15         ES/GS Ring Input SRD
3988     16-19         GS/VS Ring Output #0
3989     20-23         GS/VS Ring Output #1
3990     24-27         GS/VS Ring Output #2
3991     28-31         GS/VS Ring Output #3
3992     32-35         GS/VS Ring Input SRD
3993     36-39         Tessellation Factor Buffer SRD
3994     40-43         Off-Chip LDS Buffer SRD
3995     44-47         Off-Chip Param Cache Buffer SRD
3996     48-51         Sample Position Buffer SRD
3997     52            vaRange::ShadowDescriptorTable High Bits
3998     ============= ================================
3999
4000  The pointer to the global internal table passed to the shader as user data
4001  is a 32-bit pointer. The top 32 bits should be assumed to be the same as
4002  the top 32 bits of the pipeline, so the shader may use the program
4003  counter's top 32 bits.
4004
4005Unspecified OS
4006--------------
4007
4008This section provides code conventions used when the target triple OS is
4009empty (see :ref:`amdgpu-target-triples`).
4010
4011Trap Handler ABI
4012~~~~~~~~~~~~~~~~
4013
4014For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
4015not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
4016instructions are handled as follows:
4017
4018  .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
4019     :name: amdgpu-trap-handler-for-non-amdhsa-os-table
4020
4021     =============== =============== ===========================================
4022     Usage           Code Sequence   Description
4023     =============== =============== ===========================================
4024     llvm.trap       s_endpgm        Causes wavefront to be terminated.
4025     llvm.debugtrap  *none*          Compiler warning given that there is no
4026                                     trap handler installed.
4027     =============== =============== ===========================================
4028
4029Source Languages
4030================
4031
4032.. _amdgpu-opencl:
4033
4034OpenCL
4035------
4036
4037When the language is OpenCL the following differences occur:
4038
40391. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
40402. The AMDGPU backend appends additional arguments to the kernel's explicit
4041   arguments for the AMDHSA OS (see
4042   :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
40433. Additional metadata is generated
4044   (see :ref:`amdgpu-amdhsa-code-object-metadata`).
4045
4046  .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
4047     :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
4048
4049     ======== ==== ========= ===========================================
4050     Position Byte Byte      Description
4051              Size Alignment
4052     ======== ==== ========= ===========================================
4053     1        8    8         OpenCL Global Offset X
4054     2        8    8         OpenCL Global Offset Y
4055     3        8    8         OpenCL Global Offset Z
4056     4        8    8         OpenCL address of printf buffer
4057     5        8    8         OpenCL address of virtual queue used by
4058                             enqueue_kernel.
4059     6        8    8         OpenCL address of AqlWrap struct used by
4060                             enqueue_kernel.
4061     ======== ==== ========= ===========================================
4062
4063.. _amdgpu-hcc:
4064
4065HCC
4066---
4067
4068When the language is HCC the following differences occur:
4069
40701. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4071
4072.. _amdgpu-assembler:
4073
4074Assembler
4075---------
4076
4077AMDGPU backend has LLVM-MC based assembler which is currently in development.
4078It supports AMDGCN GFX6-GFX9.
4079
4080This section describes general syntax for instructions and operands.
4081
4082Instructions
4083~~~~~~~~~~~~
4084
4085.. toctree::
4086   :hidden:
4087
4088   AMDGPUAsmGFX7
4089   AMDGPUAsmGFX8
4090   AMDGPUAsmGFX9
4091   AMDGPUOperandSyntax
4092
4093An instruction has the following syntax:
4094
4095    *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
4096
4097Note that operands are normally comma-separated while modifiers are space-separated.
4098
4099The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4100
4101See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4102:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4103
4104Note that features under development are not included in this description.
4105
4106For more information about instructions, their semantics and supported combinations of
4107operands, refer to one of instruction set architecture manuals
4108[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
4109
4110Operands
4111~~~~~~~~
4112
4113The following syntax for register operands is supported:
4114
4115* SGPR registers: s0, ... or s[0], ...
4116* VGPR registers: v0, ... or v[0], ...
4117* TTMP registers: ttmp0, ... or ttmp[0], ...
4118* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4119* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4120* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4121* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4122* Register index expressions: v[2*2], s[1-1:2-1]
4123* 'off' indicates that an operand is not enabled
4124
4125Modifiers
4126~~~~~~~~~
4127
4128Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
4129
4130Instruction Examples
4131~~~~~~~~~~~~~~~~~~~~
4132
4133DS
4134++
4135
4136.. code-block:: nasm
4137
4138  ds_add_u32 v2, v4 offset:16
4139  ds_write_src2_b64 v2 offset0:4 offset1:8
4140  ds_cmpst_f32 v2, v4, v6
4141  ds_min_rtn_f64 v[8:9], v2, v[4:5]
4142
4143
4144For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4145
4146FLAT
4147++++
4148
4149.. code-block:: nasm
4150
4151  flat_load_dword v1, v[3:4]
4152  flat_store_dwordx3 v[3:4], v[5:7]
4153  flat_atomic_swap v1, v[3:4], v5 glc
4154  flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4155  flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4156
4157For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4158
4159MUBUF
4160+++++
4161
4162.. code-block:: nasm
4163
4164  buffer_load_dword v1, off, s[4:7], s1
4165  buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4166  buffer_store_format_xy v[1:2], off, s[4:7], s1
4167  buffer_wbinvl1
4168  buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4169
4170For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4171
4172SMRD/SMEM
4173+++++++++
4174
4175.. code-block:: nasm
4176
4177  s_load_dword s1, s[2:3], 0xfc
4178  s_load_dwordx8 s[8:15], s[2:3], s4
4179  s_load_dwordx16 s[88:103], s[2:3], s4
4180  s_dcache_inv_vol
4181  s_memtime s[4:5]
4182
4183For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4184
4185SOP1
4186++++
4187
4188.. code-block:: nasm
4189
4190  s_mov_b32 s1, s2
4191  s_mov_b64 s[0:1], 0x80000000
4192  s_cmov_b32 s1, 200
4193  s_wqm_b64 s[2:3], s[4:5]
4194  s_bcnt0_i32_b64 s1, s[2:3]
4195  s_swappc_b64 s[2:3], s[4:5]
4196  s_cbranch_join s[4:5]
4197
4198For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4199
4200SOP2
4201++++
4202
4203.. code-block:: nasm
4204
4205  s_add_u32 s1, s2, s3
4206  s_and_b64 s[2:3], s[4:5], s[6:7]
4207  s_cselect_b32 s1, s2, s3
4208  s_andn2_b32 s2, s4, s6
4209  s_lshr_b64 s[2:3], s[4:5], s6
4210  s_ashr_i32 s2, s4, s6
4211  s_bfm_b64 s[2:3], s4, s6
4212  s_bfe_i64 s[2:3], s[4:5], s6
4213  s_cbranch_g_fork s[4:5], s[6:7]
4214
4215For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4216
4217SOPC
4218++++
4219
4220.. code-block:: nasm
4221
4222  s_cmp_eq_i32 s1, s2
4223  s_bitcmp1_b32 s1, s2
4224  s_bitcmp0_b64 s[2:3], s4
4225  s_setvskip s3, s5
4226
4227For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4228
4229SOPP
4230++++
4231
4232.. code-block:: nasm
4233
4234  s_barrier
4235  s_nop 2
4236  s_endpgm
4237  s_waitcnt 0 ; Wait for all counters to be 0
4238  s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4239  s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4240  s_sethalt 9
4241  s_sleep 10
4242  s_sendmsg 0x1
4243  s_sendmsg sendmsg(MSG_INTERRUPT)
4244  s_trap 1
4245
4246For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4247
4248Unless otherwise mentioned, little verification is performed on the operands
4249of SOPP Instructions, so it is up to the programmer to be familiar with the
4250range or acceptable values.
4251
4252VALU
4253++++
4254
4255For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4256the assembler will automatically use optimal encoding based on its operands.
4257To force specific encoding, one can add a suffix to the opcode of the instruction:
4258
4259* _e32 for 32-bit VOP1/VOP2/VOPC
4260* _e64 for 64-bit VOP3
4261* _dpp for VOP_DPP
4262* _sdwa for VOP_SDWA
4263
4264VOP1/VOP2/VOP3/VOPC examples:
4265
4266.. code-block:: nasm
4267
4268  v_mov_b32 v1, v2
4269  v_mov_b32_e32 v1, v2
4270  v_nop
4271  v_cvt_f64_i32_e32 v[1:2], v2
4272  v_floor_f32_e32 v1, v2
4273  v_bfrev_b32_e32 v1, v2
4274  v_add_f32_e32 v1, v2, v3
4275  v_mul_i32_i24_e64 v1, v2, 3
4276  v_mul_i32_i24_e32 v1, -3, v3
4277  v_mul_i32_i24_e32 v1, -100, v3
4278  v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4279  v_max_f16_e32 v1, v2, v3
4280
4281VOP_DPP examples:
4282
4283.. code-block:: nasm
4284
4285  v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4286  v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4287  v_mov_b32 v0, v0 wave_shl:1
4288  v_mov_b32 v0, v0 row_mirror
4289  v_mov_b32 v0, v0 row_bcast:31
4290  v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4291  v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4292  v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4293
4294VOP_SDWA examples:
4295
4296.. code-block:: nasm
4297
4298  v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4299  v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4300  v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4301  v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4302  v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4303
4304For full list of supported instructions, refer to "Vector ALU instructions".
4305
4306.. TODO
4307   Remove once we switch to code object v3 by default.
4308
4309HSA Code Object Directives
4310~~~~~~~~~~~~~~~~~~~~~~~~~~
4311
4312AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4313one can specify them with assembler directives.
4314
4315.hsa_code_object_version major, minor
4316+++++++++++++++++++++++++++++++++++++
4317
4318*major* and *minor* are integers that specify the version of the HSA code
4319object that will be generated by the assembler.
4320
4321.hsa_code_object_isa [major, minor, stepping, vendor, arch]
4322+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4323
4324
4325*major*, *minor*, and *stepping* are all integers that describe the instruction
4326set architecture (ISA) version of the assembly program.
4327
4328*vendor* and *arch* are quoted strings.  *vendor* should always be equal to
4329"AMD" and *arch* should always be equal to "AMDGPU".
4330
4331By default, the assembler will derive the ISA version, *vendor*, and *arch*
4332from the value of the -mcpu option that is passed to the assembler.
4333
4334.amdgpu_hsa_kernel (name)
4335+++++++++++++++++++++++++
4336
4337This directives specifies that the symbol with given name is a kernel entry point
4338(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4339
4340.amd_kernel_code_t
4341++++++++++++++++++
4342
4343This directive marks the beginning of a list of key / value pairs that are used
4344to specify the amd_kernel_code_t object that will be emitted by the assembler.
4345The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
4346any amd_kernel_code_t values that are unspecified a default value will be
4347used.  The default value for all keys is 0, with the following exceptions:
4348
4349- *kernel_code_version_major* defaults to 1.
4350- *machine_kind* defaults to 1.
4351- *machine_version_major*, *machine_version_minor*, and
4352  *machine_version_stepping* are derived from the value of the -mcpu option
4353  that is passed to the assembler.
4354- *kernel_code_entry_byte_offset* defaults to 256.
4355- *wavefront_size* defaults to 6.
4356- *kernarg_segment_alignment*, *group_segment_alignment*, and
4357  *private_segment_alignment* default to 4. Note that alignments are specified
4358  as a power of two, so a value of **n** means an alignment of 2^ **n**.
4359
4360The *.amd_kernel_code_t* directive must be placed immediately after the
4361function label and before any instructions.
4362
4363For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4364comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4365
4366Here is an example of a minimal amd_kernel_code_t specification:
4367
4368.. code-block:: none
4369
4370   .hsa_code_object_version 1,0
4371   .hsa_code_object_isa
4372
4373   .hsatext
4374   .globl  hello_world
4375   .p2align 8
4376   .amdgpu_hsa_kernel hello_world
4377
4378   hello_world:
4379
4380      .amd_kernel_code_t
4381         enable_sgpr_kernarg_segment_ptr = 1
4382         is_ptr64 = 1
4383         compute_pgm_rsrc1_vgprs = 0
4384         compute_pgm_rsrc1_sgprs = 0
4385         compute_pgm_rsrc2_user_sgpr = 2
4386         kernarg_segment_byte_size = 8
4387         wavefront_sgpr_count = 2
4388         workitem_vgpr_count = 3
4389     .end_amd_kernel_code_t
4390
4391     s_load_dwordx2 s[0:1], s[0:1] 0x0
4392     v_mov_b32 v0, 3.14159
4393     s_waitcnt lgkmcnt(0)
4394     v_mov_b32 v1, s0
4395     v_mov_b32 v2, s1
4396     flat_store_dword v[1:2], v0
4397     s_endpgm
4398   .Lfunc_end0:
4399        .size   hello_world, .Lfunc_end0-hello_world
4400
4401Predefined Symbols (-mattr=+code-object-v3)
4402~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4403
4404The AMDGPU assembler defines and updates some symbols automatically. These
4405symbols do not affect code generation.
4406
4407.amdgcn.gfx_generation_number
4408+++++++++++++++++++++++++++++
4409
4410Set to the GFX generation number of the target being assembled for. For
4411example, when assembling for a "GFX9" target this will be set to the integer
4412value "9". The possible GFX generation numbers are presented in
4413:ref:`amdgpu-processors`.
4414
4415.amdgcn.next_free_vgpr
4416++++++++++++++++++++++
4417
4418Set to zero before assembly begins. At each instruction, if the current value
4419of this symbol is less than or equal to the maximum VGPR number explicitly
4420referenced within that instruction then the symbol value is updated to equal
4421that VGPR number plus one.
4422
4423May be used to set the `.amdhsa_next_free_vpgr` directive in
4424:ref:`amdhsa-kernel-directives-table`.
4425
4426May be set at any time, e.g. manually set to zero at the start of each kernel.
4427
4428.amdgcn.next_free_sgpr
4429++++++++++++++++++++++
4430
4431Set to zero before assembly begins. At each instruction, if the current value
4432of this symbol is less than or equal the maximum SGPR number explicitly
4433referenced within that instruction then the symbol value is updated to equal
4434that SGPR number plus one.
4435
4436May be used to set the `.amdhsa_next_free_spgr` directive in
4437:ref:`amdhsa-kernel-directives-table`.
4438
4439May be set at any time, e.g. manually set to zero at the start of each kernel.
4440
4441Code Object Directives (-mattr=+code-object-v3)
4442~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4443
4444Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
4445architecture processors, and are not OS-specific. Directives which begin with
4446``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
4447``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
4448:ref:`amdgpu-processors`.
4449
4450.amdgcn_target <target>
4451+++++++++++++++++++++++
4452
4453Optional directive which declares the target supported by the containing
4454assembler source file. Valid values are described in
4455:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
4456to validate command-line options such as ``-triple``, ``-mcpu``, and those
4457which specify target features.
4458
4459.amdhsa_kernel <name>
4460+++++++++++++++++++++
4461
4462Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
4463``<name>.kd``, in the current location of the current section. Only valid when
4464the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
4465instruction to execute, and does not need to be previously defined.
4466
4467Marks the beginning of a list of directives used to generate the bytes of a
4468kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
4469Directives which may appear in this list are described in
4470:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
4471be valid for the target being assembled for, and cannot be repeated. Directives
4472support the range of values specified by the field they reference in
4473:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
4474assumed to have its default value, unless it is marked as "Required", in which
4475case it is an error to omit the directive. This list of directives is
4476terminated by an ``.end_amdhsa_kernel`` directive.
4477
4478  .. table:: AMDHSA Kernel Assembler Directives
4479     :name: amdhsa-kernel-directives-table
4480
4481     ======================================================== ================ ============ ===================
4482     Directive                                                Default          Supported On Description
4483     ======================================================== ================ ============ ===================
4484     ``.amdhsa_group_segment_fixed_size``                     0                GFX6-GFX9    Controls GROUP_SEGMENT_FIXED_SIZE in
4485                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4486     ``.amdhsa_private_segment_fixed_size``                   0                GFX6-GFX9    Controls PRIVATE_SEGMENT_FIXED_SIZE in
4487                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4488     ``.amdhsa_user_sgpr_private_segment_buffer``             0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
4489                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4490     ``.amdhsa_user_sgpr_dispatch_ptr``                       0                GFX6-GFX9    Controls ENABLE_SGPR_DISPATCH_PTR in
4491                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4492     ``.amdhsa_user_sgpr_queue_ptr``                          0                GFX6-GFX9    Controls ENABLE_SGPR_QUEUE_PTR in
4493                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4494     ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                GFX6-GFX9    Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
4495                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4496     ``.amdhsa_user_sgpr_dispatch_id``                        0                GFX6-GFX9    Controls ENABLE_SGPR_DISPATCH_ID in
4497                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4498     ``.amdhsa_user_sgpr_flat_scratch_init``                  0                GFX6-GFX9    Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
4499                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4500     ``.amdhsa_user_sgpr_private_segment_size``               0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
4501                                                                                            :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4502     ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
4503                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4504     ``.amdhsa_system_sgpr_workgroup_id_x``                   1                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_X in
4505                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4506     ``.amdhsa_system_sgpr_workgroup_id_y``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_Y in
4507                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4508     ``.amdhsa_system_sgpr_workgroup_id_z``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_Z in
4509                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4510     ``.amdhsa_system_sgpr_workgroup_info``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_INFO in
4511                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4512     ``.amdhsa_system_vgpr_workitem_id``                      0                GFX6-GFX9    Controls ENABLE_VGPR_WORKITEM_ID in
4513                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4514                                                                                            Possible values are defined in
4515                                                                                            :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
4516     ``.amdhsa_next_free_vgpr``                               Required         GFX6-GFX9    Maximum VGPR number explicitly referenced, plus one.
4517                                                                                            Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
4518                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4519     ``.amdhsa_next_free_sgpr``                               Required         GFX6-GFX9    Maximum SGPR number explicitly referenced, plus one.
4520                                                                                            Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4521                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4522     ``.amdhsa_reserve_vcc``                                  1                GFX6-GFX9    Whether the kernel may use the special VCC SGPR.
4523                                                                                            Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4524                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4525     ``.amdhsa_reserve_flat_scratch``                         1                GFX7-GFX9    Whether the kernel may use flat instructions to access
4526                                                                                            scratch memory. Used to calculate
4527                                                                                            GRANULATED_WAVEFRONT_SGPR_COUNT in
4528                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4529     ``.amdhsa_reserve_xnack_mask``                           Target           GFX8-GFX9    Whether the kernel may trigger XNACK replay.
4530                                                              Feature                       Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4531                                                              Specific                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4532                                                              (+xnack)
4533     ``.amdhsa_float_round_mode_32``                          0                GFX6-GFX9    Controls FLOAT_ROUND_MODE_32 in
4534                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4535                                                                                            Possible values are defined in
4536                                                                                            :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4537     ``.amdhsa_float_round_mode_16_64``                       0                GFX6-GFX9    Controls FLOAT_ROUND_MODE_16_64 in
4538                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4539                                                                                            Possible values are defined in
4540                                                                                            :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4541     ``.amdhsa_float_denorm_mode_32``                         0                GFX6-GFX9    Controls FLOAT_DENORM_MODE_32 in
4542                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4543                                                                                            Possible values are defined in
4544                                                                                            :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4545     ``.amdhsa_float_denorm_mode_16_64``                      3                GFX6-GFX9    Controls FLOAT_DENORM_MODE_16_64 in
4546                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4547                                                                                            Possible values are defined in
4548                                                                                            :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4549     ``.amdhsa_dx10_clamp``                                   1                GFX6-GFX9    Controls ENABLE_DX10_CLAMP in
4550                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4551     ``.amdhsa_ieee_mode``                                    1                GFX6-GFX9    Controls ENABLE_IEEE_MODE in
4552                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4553     ``.amdhsa_fp16_overflow``                                0                GFX9         Controls FP16_OVFL in
4554                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4555     ``.amdhsa_exception_fp_ieee_invalid_op``                 0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
4556                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4557     ``.amdhsa_exception_fp_denorm_src``                      0                GFX6-GFX9    Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
4558                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4559     ``.amdhsa_exception_fp_ieee_div_zero``                   0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
4560                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4561     ``.amdhsa_exception_fp_ieee_overflow``                   0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
4562                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4563     ``.amdhsa_exception_fp_ieee_underflow``                  0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
4564                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4565     ``.amdhsa_exception_fp_ieee_inexact``                    0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
4566                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4567     ``.amdhsa_exception_int_div_zero``                       0                GFX6-GFX9    Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
4568                                                                                            :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4569     ======================================================== ================ ============ ===================
4570
4571Example HSA Source Code (-mattr=+code-object-v3)
4572~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4573
4574Here is an example of a minimal assembly source file, defining one HSA kernel:
4575
4576.. code-block:: nasm
4577
4578  .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
4579
4580  .text
4581  .globl hello_world
4582  .p2align 8
4583  .type hello_world,@function
4584  hello_world:
4585    s_load_dwordx2 s[0:1], s[0:1] 0x0
4586    v_mov_b32 v0, 3.14159
4587    s_waitcnt lgkmcnt(0)
4588    v_mov_b32 v1, s0
4589    v_mov_b32 v2, s1
4590    flat_store_dword v[1:2], v0
4591    s_endpgm
4592  .Lfunc_end0:
4593    .size   hello_world, .Lfunc_end0-hello_world
4594
4595  .rodata
4596  .p2align 6
4597  .amdhsa_kernel hello_world
4598    .amdhsa_user_sgpr_kernarg_segment_ptr 1
4599    .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
4600    .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
4601  .end_amdhsa_kernel
4602
4603
4604Additional Documentation
4605========================
4606
4607.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4608.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4609.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4610.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4611.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4612.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4613.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4614.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4615.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4616.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4617.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4618.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4619.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4620.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4621.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4622.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4623.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__
4624