1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8.. toctree:: 9 :hidden: 10 11 AMDGPU/AMDGPUAsmGFX7 12 AMDGPU/AMDGPUAsmGFX8 13 AMDGPU/AMDGPUAsmGFX9 14 AMDGPU/AMDGPUAsmGFX900 15 AMDGPU/AMDGPUAsmGFX904 16 AMDGPU/AMDGPUAsmGFX906 17 AMDGPU/AMDGPUAsmGFX908 18 AMDGPU/AMDGPUAsmGFX10 19 AMDGPU/AMDGPUAsmGFX1011 20 AMDGPUModifierSyntax 21 AMDGPUOperandSyntax 22 AMDGPUInstructionSyntax 23 AMDGPUInstructionNotation 24 AMDGPUDwarfExtensionsForHeterogeneousDebugging 25 26Introduction 27============ 28 29The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 30R600 family up until the current GCN families. It lives in the 31``llvm/lib/Target/AMDGPU`` directory. 32 33LLVM 34==== 35 36.. _amdgpu-target-triples: 37 38Target Triples 39-------------- 40 41Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>`` 42to specify the target triple: 43 44 .. table:: AMDGPU Architectures 45 :name: amdgpu-architecture-table 46 47 ============ ============================================================== 48 Architecture Description 49 ============ ============================================================== 50 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 51 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 52 ============ ============================================================== 53 54 .. table:: AMDGPU Vendors 55 :name: amdgpu-vendor-table 56 57 ============ ============================================================== 58 Vendor Description 59 ============ ============================================================== 60 ``amd`` Can be used for all AMD GPU usage. 61 ``mesa3d`` Can be used if the OS is ``mesa3d``. 62 ============ ============================================================== 63 64 .. table:: AMDGPU Operating Systems 65 :name: amdgpu-os 66 67 ============== ============================================================ 68 OS Description 69 ============== ============================================================ 70 *<empty>* Defaults to the *unknown* OS. 71 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 72 such as: 73 74 - AMD's ROCm runtime [AMD-ROCm]_ on Linux. See *AMD ROCm 75 Release Notes* [AMD-ROCm-Release-Notes]_ for supported 76 hardware and software. 77 - AMD's PAL runtime using the *amdhsa* loader on Windows. 78 79 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL 80 runtime using the *amdpal* loader on Windows and Linux Pro. 81 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D 82 runtime. 83 ============== ============================================================ 84 85 .. table:: AMDGPU Environments 86 :name: amdgpu-environment-table 87 88 ============ ============================================================== 89 Environment Description 90 ============ ============================================================== 91 *<empty>* Default. 92 ============ ============================================================== 93 94.. _amdgpu-processors: 95 96Processors 97---------- 98 99Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to 100specify the AMDGPU processor together with optional target features. See 101:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target 102specific information. 103 104 .. table:: AMDGPU Processors 105 :name: amdgpu-processor-table 106 107 =========== =============== ============ ===== ================= =========== ============== ====================== 108 Processor Alternative Target dGPU/ Target Target OS Example 109 Processor Triple APU Features Properties Support Products 110 Architecture Supported () 111 *(see* 112 `amdgpu-os`_ 113 *and 114 corresponding 115 runtime 116 release notes 117 for current 118 information 119 and level of 120 support)* 121 =========== =============== ============ ===== ================= =========== ============== ====================== 122 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 123 ------------------------------------------------------------------------------------------------------------------ 124 ``r600`` ``r600`` dGPU - Does not 125 support 126 generic 127 address 128 space 129 ``r630`` ``r600`` dGPU - Does not 130 support 131 generic 132 address 133 space 134 ``rs880`` ``r600`` dGPU - Does not 135 support 136 generic 137 address 138 space 139 ``rv670`` ``r600`` dGPU - Does not 140 support 141 generic 142 address 143 space 144 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 145 ------------------------------------------------------------------------------------------------------------------ 146 ``rv710`` ``r600`` dGPU - Does not 147 support 148 generic 149 address 150 space 151 ``rv730`` ``r600`` dGPU - Does not 152 support 153 generic 154 address 155 space 156 ``rv770`` ``r600`` dGPU - Does not 157 support 158 generic 159 address 160 space 161 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 162 ------------------------------------------------------------------------------------------------------------------ 163 ``cedar`` ``r600`` dGPU - Does not 164 support 165 generic 166 address 167 space 168 ``cypress`` ``r600`` dGPU - Does not 169 support 170 generic 171 address 172 space 173 ``juniper`` ``r600`` dGPU - Does not 174 support 175 generic 176 address 177 space 178 ``redwood`` ``r600`` dGPU - Does not 179 support 180 generic 181 address 182 space 183 ``sumo`` ``r600`` dGPU - Does not 184 support 185 generic 186 address 187 space 188 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 189 ------------------------------------------------------------------------------------------------------------------ 190 ``barts`` ``r600`` dGPU - Does not 191 support 192 generic 193 address 194 space 195 ``caicos`` ``r600`` dGPU - Does not 196 support 197 generic 198 address 199 space 200 ``cayman`` ``r600`` dGPU - Does not 201 support 202 generic 203 address 204 space 205 ``turks`` ``r600`` dGPU - Does not 206 support 207 generic 208 address 209 space 210 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 211 ------------------------------------------------------------------------------------------------------------------ 212 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - AMD PAL 213 support 214 generic 215 address 216 space 217 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - AMD PAL 218 - ``verde`` support 219 generic 220 address 221 space 222 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - AMD PAL 223 - ``oland`` support 224 generic 225 address 226 space 227 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 228 ------------------------------------------------------------------------------------------------------------------ 229 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - AMD ROCm - A6-7000 230 - AMD PAL - A6 Pro-7050B 231 - A8-7100 232 - A8 Pro-7150B 233 - A10-7300 234 - A10 Pro-7350B 235 - FX-7500 236 - A8-7200P 237 - A10-7400P 238 - FX-7600P 239 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - AMD ROCm - FirePro W8100 240 - AMD PAL - FirePro W9100 241 - FirePro S9150 242 - FirePro S9170 243 ``gfx702`` ``amdgcn`` dGPU - AMD ROCm - Radeon R9 290 244 - AMD PAL - Radeon R9 290x 245 - Radeon R390 246 - Radeon R390x 247 ``gfx703`` - ``kabini`` ``amdgcn`` APU - AMD PAL - E1-2100 248 - ``mullins`` - E1-2200 249 - E1-2500 250 - E2-3000 251 - E2-3800 252 - A4-5000 253 - A4-5100 254 - A6-5200 255 - A4 Pro-3340B 256 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - AMD PAL - Radeon HD 7790 257 - Radeon HD 8770 258 - R7 260 259 - R7 260X 260 ``gfx705`` ``amdgcn`` APU - AMD PAL *TBA* 261 262 .. TODO:: 263 264 Add product 265 names. 266 267 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 268 ------------------------------------------------------------------------------------------------------------------ 269 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - AMD ROCm - A6-8500P 270 - AMD PAL - Pro A6-8500B 271 - A8-8600P 272 - Pro A8-8600B 273 - FX-8800P 274 - Pro A12-8800B 275 - A10-8700P 276 - Pro A10-8700B 277 - A10-8780P 278 - A10-9600P 279 - A10-9630P 280 - A12-9700P 281 - A12-9730P 282 - FX-9800P 283 - FX-9830P 284 - E2-9010 285 - A6-9210 286 - A9-9410 287 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - AMD ROCm - Radeon R9 285 288 - ``tonga`` - AMD PAL - Radeon R9 380 289 - Radeon R9 385 290 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - AMD ROCm - Radeon R9 Nano 291 - AMD PAL - Radeon R9 Fury 292 - Radeon R9 FuryX 293 - Radeon Pro Duo 294 - FirePro S9300x2 295 - Radeon Instinct MI8 296 \ - ``polaris10`` ``amdgcn`` dGPU - AMD ROCm - Radeon RX 470 297 - AMD PAL - Radeon RX 480 298 - Radeon Instinct MI6 299 \ - ``polaris11`` ``amdgcn`` dGPU - AMD ROCm - Radeon RX 460 300 - AMD PAL 301 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - AMD ROCm - FirePro S7150 302 - AMD PAL - FirePro S7100 303 - FirePro W7100 304 - Mobile FirePro 305 M7170 306 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - AMD ROCm *TBA* 307 - AMD PAL 308 .. TODO:: 309 310 Add product 311 names. 312 313 **GCN GFX9** [AMD-GCN-GFX9]_ 314 ------------------------------------------------------------------------------------------------------------------ 315 ``gfx900`` ``amdgcn`` dGPU - xnack - AMD ROCm - Radeon Vega 316 - AMD PAL Frontier Edition 317 - Radeon RX Vega 56 318 - Radeon RX Vega 64 319 - Radeon RX Vega 64 320 Liquid 321 - Radeon Instinct MI25 322 ``gfx902`` ``amdgcn`` APU - xnack - AMD ROCm - Ryzen 3 2200G 323 - AMD PAL - Ryzen 5 2400G 324 ``gfx904`` ``amdgcn`` dGPU - xnack - AMD ROCm *TBA* 325 - AMD PAL 326 .. TODO:: 327 328 Add product 329 names. 330 331 ``gfx906`` ``amdgcn`` dGPU - sramecc - AMD ROCm - Radeon Instinct MI50 332 - xnack - AMD PAL - Radeon Instinct MI60 333 - Radeon VII 334 - Radeon Pro VII 335 ``gfx908`` ``amdgcn`` dGPU - sramecc - AMD ROCm *TBA* 336 - xnack 337 .. TODO:: 338 339 Add product 340 names. 341 342 ``gfx909`` ``amdgcn`` APU - xnack - AMD PAL *TBA* 343 344 .. TODO:: 345 346 Add product 347 names. 348 349 ``gfx90c`` ``amdgcn`` APU - xnack - AMD PAL - Ryzen 7 4700G 350 - Ryzen 7 4700GE 351 - Ryzen 5 4600G 352 - Ryzen 5 4600GE 353 - Ryzen 3 4300G 354 - Ryzen 3 4300GE 355 - Ryzen Pro 4000G 356 - Ryzen 7 Pro 4700G 357 - Ryzen 7 Pro 4750GE 358 - Ryzen 5 Pro 4650G 359 - Ryzen 5 Pro 4650GE 360 - Ryzen 3 Pro 4350G 361 - Ryzen 3 Pro 4350GE 362 363 **GCN GFX10** [AMD-GCN-GFX10]_ 364 ------------------------------------------------------------------------------------------------------------------ 365 ``gfx1010`` ``amdgcn`` dGPU - cumode - AMD ROCm - Radeon RX 5700 366 - wavefrontsize64 - AMD PAL - Radeon RX 5700 XT 367 - xnack - Radeon Pro 5600 XT 368 - Radeon Pro 5600M 369 ``gfx1011`` ``amdgcn`` dGPU - cumode - AMD ROCm *TBA* 370 - wavefrontsize64 - AMD PAL 371 - xnack 372 .. TODO:: 373 374 Add product 375 names. 376 377 ``gfx1012`` ``amdgcn`` dGPU - cumode - AMD ROCm - Radeon RX 5500 378 - wavefrontsize64 - AMD PAL - Radeon RX 5500 XT 379 - xnack 380 381 ``gfx1030`` ``amdgcn`` dGPU - cumode - AMD ROCm *TBA* 382 - wavefrontsize64 - AMD PAL 383 .. TODO:: 384 385 Add product 386 names. 387 388 ``gfx1031`` ``amdgcn`` dGPU - cumode - AMD ROCm *TBA* 389 - wavefrontsize64 - AMD PAL 390 .. TODO:: 391 392 Add product 393 names. 394 395 ``gfx1032`` ``amdgcn`` dGPU - cumode - AMD PAL *TBA* 396 - wavefrontsize64 397 .. TODO:: 398 399 Add product 400 names. 401 402 ``gfx1033`` ``amdgcn`` APU - cumode - AMD PAL *TBA* 403 - wavefrontsize64 404 .. TODO:: 405 406 Add product 407 names. 408 409 =========== =============== ============ ===== ================= =========== ============== ====================== 410 411.. _amdgpu-target-features: 412 413Target Features 414--------------- 415 416Target features control how code is generated to support certain 417processor specific features. Not all target features are supported by 418all processors. The runtime must ensure that the features supported by 419the device used to execute the code match the features enabled when 420generating the code. A mismatch of features may result in incorrect 421execution, or a reduction in performance. 422 423The target features supported by each processor is listed in 424:ref:`amdgpu-processor-table`. 425 426Target features are controlled by exactly one of the following Clang 427options: 428 429``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` 430 431 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as 432 optional components of the target ID. If omitted, the target feature has the 433 ``any`` value. See :ref:`amdgpu-target-id`. 434 435``-m[no-]<target-feature>`` 436 437 Target features not specified by the target ID are specified using a 438 separate option. These target features can have an ``on`` or ``off`` 439 value. ``on`` is specified by omitting the ``no-`` prefix, and 440 ``off`` is specified by including the ``no-`` prefix. The default 441 if not specified is ``off``. 442 443For example: 444 445``-mcpu=gfx908:xnack+`` 446 Enable the ``xnack`` feature. 447``-mcpu=gfx908:xnack-`` 448 Disable the ``xnack`` feature. 449``-mcumode`` 450 Enable the ``cumode`` feature. 451``-mno-cumode`` 452 Disable the ``cumode`` feature. 453 454 .. table:: AMDGPU Target Features 455 :name: amdgpu-target-features-table 456 457 =============== ============================ ================================================== 458 Target Feature Clang Option to Control Description 459 Name 460 =============== ============================ ================================================== 461 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used 462 when generating code for kernels. When disabled 463 native WGP wavefront execution mode is used, 464 when enabled CU wavefront execution mode is used 465 (see :ref:`amdgpu-amdhsa-memory-model`). 466 467 sramecc - ``-mcpu`` If specified, generate code that can only be 468 - ``--offload-arch`` loaded and executed in a process that has a 469 matching setting for SRAMECC. 470 471 If not specified, generate code that can be 472 loaded and executed in a process with either 473 setting of SRAMECC. 474 475 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when 476 generating code for kernels. When disabled 477 native wavefront size 32 is used, when enabled 478 wavefront size 64 is used. 479 480 xnack - ``-mcpu`` If specified, generate code that can only be 481 - ``--offload-arch`` loaded and executed in a process that has a 482 matching setting for XNACK replay. 483 484 If not specified, generate code that can be 485 loaded and executed in a process with either 486 setting of XNACK replay. 487 488 This is used for demand paging and page 489 migration. If XNACK replay is enabled in 490 the device, then if a page fault occurs 491 the code may execute incorrectly if the 492 ``xnack`` feature is not enabled. Executing 493 code that has the feature enabled on a 494 device that does not have XNACK replay 495 enabled will execute correctly but may 496 be less performant than code with the 497 feature disabled. 498 =============== ============================ ================================================== 499 500.. _amdgpu-target-id: 501 502Target ID 503--------- 504 505AMDGPU supports target IDs. See `Clang Offload Bundler 506<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general 507description. The AMDGPU target specific information is: 508 509**processor** 510 Is a AMDGPU processor or alternative processor name specified in 511 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both 512 the primary processor and alternative processor names. The canonical form 513 target ID only allow the primary processor name. 514 515**target-feature** 516 Is a target feature name specified in :ref:`amdgpu-target-features-table` that 517 is supported by the processor. The target features supported by each processor 518 is specified in :ref:`amdgpu-processor-table`. Those that can be specifeid in 519 a target ID are marked as being controlled by ``-mcpu`` and 520 ``--offload-arch``. Each target feature must appear at most once in a target 521 ID. The non-canonical form target ID allows the target features to be 522 specified in any order. The canonical form target ID requires the target 523 features to be specified in alphabetic order. 524 525.. _amdgpu-embedding-bundled-objects: 526 527Embedding Bundled Code Objects 528------------------------------ 529 530AMDGPU supports the HIP and OpenMP languages that perform code object embedding 531as described in `Clang Offload Bundler 532<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_. 533 534.. _amdgpu-address-spaces: 535 536Address Spaces 537-------------- 538 539The AMDGPU architecture supports a number of memory address spaces. The address 540space names use the OpenCL standard names, with some additions. 541 542The AMDGPU address spaces correspond to target architecture specific LLVM 543address space numbers used in LLVM IR. 544 545The AMDGPU address spaces are described in 546:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are 547supported for the ``amdgcn`` target. 548 549 .. table:: AMDGPU Address Spaces 550 :name: amdgpu-address-spaces-table 551 552 ================================= =============== =========== ================ ======= ============================ 553 .. 64-Bit Process Address Space 554 --------------------------------- --------------- ----------- ---------------- ------------------------------------ 555 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value 556 Space Number Name Name Size 557 ================================= =============== =========== ================ ======= ============================ 558 Generic 0 flat flat 64 0x0000000000000000 559 Global 1 global global 64 0x0000000000000000 560 Region 2 N/A GDS 32 *not implemented for AMDHSA* 561 Local 3 group LDS 32 0xFFFFFFFF 562 Constant 4 constant *same as global* 64 0x0000000000000000 563 Private 5 private scratch 32 0xFFFFFFFF 564 Constant 32-bit 6 *TODO* 0x00000000 565 Buffer Fat Pointer (experimental) 7 *TODO* 566 ================================= =============== =========== ================ ======= ============================ 567 568**Generic** 569 The generic address space is supported unless the *Target Properties* column 570 of :ref:`amdgpu-processor-table` specifies *Does not support generic address 571 space*. 572 573 The generic address space uses the hardware flat address support for two fixed 574 ranges of virtual addresses (the private and local apertures), that are 575 outside the range of addressable global memory, to map from a flat address to 576 a private or local address. This uses FLAT instructions that can take a flat 577 address and access global, private (scratch), and group (LDS) memory depending 578 on if the address is within one of the aperture ranges. 579 580 Flat access to scratch requires hardware aperture setup and setup in the 581 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat 582 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register 583 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). 584 585 To convert between a private or group address space address (termed a segment 586 address) and a flat address the base address of the corresponding aperture 587 can be used. For GFX7-GFX8 these are available in the 588 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 589 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 590 GFX9-GFX10 the aperture base addresses are directly available as inline 591 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. 592 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is 593 aligned to 2^32 which makes it easier to convert from flat to segment or 594 segment to flat. 595 596 A global address space address has the same value when used as a flat address 597 so no conversion is needed. 598 599**Global and Constant** 600 The global and constant address spaces both use global virtual addresses, 601 which are the same virtual address space used by the CPU. However, some 602 virtual addresses may only be accessible to the CPU, some only accessible 603 by the GPU, and some by both. 604 605 Using the constant address space indicates that the data will not change 606 during the execution of the kernel. This allows scalar read instructions to 607 be used. As the constant address space could only be modified on the host 608 side, a generic pointer loaded from the constant address space is safe to be 609 assumed as a global pointer since only the device global memory is visible 610 and managed on the host side. The vector and scalar L1 caches are invalidated 611 of volatile data before each kernel dispatch execution to allow constant 612 memory to change values between kernel dispatches. 613 614**Region** 615 The region address space uses the hardware Global Data Store (GDS). All 616 wavefronts executing on the same device will access the same memory for any 617 given region address. However, the same region address accessed by wavefronts 618 executing on different devices will access different memory. It is higher 619 performance than global memory. It is allocated by the runtime. The data 620 store (DS) instructions can be used to access it. 621 622**Local** 623 The local address space uses the hardware Local Data Store (LDS) which is 624 automatically allocated when the hardware creates the wavefronts of a 625 work-group, and freed when all the wavefronts of a work-group have 626 terminated. All wavefronts belonging to the same work-group will access the 627 same memory for any given local address. However, the same local address 628 accessed by wavefronts belonging to different work-groups will access 629 different memory. It is higher performance than global memory. The data store 630 (DS) instructions can be used to access it. 631 632**Private** 633 The private address space uses the hardware scratch memory support which 634 automatically allocates memory when it creates a wavefront and frees it when 635 a wavefronts terminates. The memory accessed by a lane of a wavefront for any 636 given private address will be different to the memory accessed by another lane 637 of the same or different wavefront for the same private address. 638 639 If a kernel dispatch uses scratch, then the hardware allocates memory from a 640 pool of backing memory allocated by the runtime for each wavefront. The lanes 641 of the wavefront access this using dword (4 byte) interleaving. The mapping 642 used from private address to backing memory address is: 643 644 ``wavefront-scratch-base + 645 ((private-address / 4) * wavefront-size * 4) + 646 (wavefront-lane-id * 4) + (private-address % 4)`` 647 648 If each lane of a wavefront accesses the same private address, the 649 interleaving results in adjacent dwords being accessed and hence requires 650 fewer cache lines to be fetched. 651 652 There are different ways that the wavefront scratch base address is 653 determined by a wavefront (see 654 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 655 656 Scratch memory can be accessed in an interleaved manner using buffer 657 instructions with the scratch buffer descriptor and per wavefront scratch 658 offset, by the scratch instructions, or by flat instructions. Multi-dword 659 access is not supported except by flat and scratch instructions in 660 GFX9-GFX10. 661 662**Constant 32-bit** 663 *TODO* 664 665**Buffer Fat Pointer** 666 The buffer fat pointer is an experimental address space that is currently 667 unsupported in the backend. It exposes a non-integral pointer that is in 668 the future intended to support the modelling of 128-bit buffer descriptors 669 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit 670 *pointer*), allowing normal LLVM load/store/atomic operations to be used to 671 model the buffer descriptors used heavily in graphics workloads targeting 672 the backend. 673 674.. _amdgpu-memory-scopes: 675 676Memory Scopes 677------------- 678 679This section provides LLVM memory synchronization scopes supported by the AMDGPU 680backend memory model when the target triple OS is ``amdhsa`` (see 681:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 682 683The memory model supported is based on the HSA memory model [HSA]_ which is 684based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 685relation is transitive over the synchronizes-with relation independent of scope 686and synchronizes-with allows the memory scope instances to be inclusive (see 687table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 688 689This is different to the OpenCL [OpenCL]_ memory model which does not have scope 690inclusion and requires the memory scopes to exactly match. However, this 691is conservatively correct for OpenCL. 692 693 .. table:: AMDHSA LLVM Sync Scopes 694 :name: amdgpu-amdhsa-llvm-sync-scopes-table 695 696 ======================= =================================================== 697 LLVM Sync Scope Description 698 ======================= =================================================== 699 *none* The default: ``system``. 700 701 Synchronizes with, and participates in modification 702 and seq_cst total orderings with, other operations 703 (except image operations) for all address spaces 704 (except private, or generic that accesses private) 705 provided the other operation's sync scope is: 706 707 - ``system``. 708 - ``agent`` and executed by a thread on the same 709 agent. 710 - ``workgroup`` and executed by a thread in the 711 same work-group. 712 - ``wavefront`` and executed by a thread in the 713 same wavefront. 714 715 ``agent`` Synchronizes with, and participates in modification 716 and seq_cst total orderings with, other operations 717 (except image operations) for all address spaces 718 (except private, or generic that accesses private) 719 provided the other operation's sync scope is: 720 721 - ``system`` or ``agent`` and executed by a thread 722 on the same agent. 723 - ``workgroup`` and executed by a thread in the 724 same work-group. 725 - ``wavefront`` and executed by a thread in the 726 same wavefront. 727 728 ``workgroup`` Synchronizes with, and participates in modification 729 and seq_cst total orderings with, other operations 730 (except image operations) for all address spaces 731 (except private, or generic that accesses private) 732 provided the other operation's sync scope is: 733 734 - ``system``, ``agent`` or ``workgroup`` and 735 executed by a thread in the same work-group. 736 - ``wavefront`` and executed by a thread in the 737 same wavefront. 738 739 ``wavefront`` Synchronizes with, and participates in modification 740 and seq_cst total orderings with, other operations 741 (except image operations) for all address spaces 742 (except private, or generic that accesses private) 743 provided the other operation's sync scope is: 744 745 - ``system``, ``agent``, ``workgroup`` or 746 ``wavefront`` and executed by a thread in the 747 same wavefront. 748 749 ``singlethread`` Only synchronizes with and participates in 750 modification and seq_cst total orderings with, 751 other operations (except image operations) running 752 in the same thread for all address spaces (for 753 example, in signal handlers). 754 755 ``one-as`` Same as ``system`` but only synchronizes with other 756 operations within the same address space. 757 758 ``agent-one-as`` Same as ``agent`` but only synchronizes with other 759 operations within the same address space. 760 761 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with 762 other operations within the same address space. 763 764 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with 765 other operations within the same address space. 766 767 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with 768 other operations within the same address space. 769 ======================= =================================================== 770 771LLVM IR Intrinsics 772------------------ 773 774The AMDGPU backend implements the following LLVM IR intrinsics. 775 776*This section is WIP.* 777 778.. TODO:: 779 780 List AMDGPU intrinsics. 781 782LLVM IR Attributes 783------------------ 784 785The AMDGPU backend supports the following LLVM IR attributes. 786 787 .. table:: AMDGPU LLVM IR Attributes 788 :name: amdgpu-llvm-ir-attributes-table 789 790 ======================================= ========================================================== 791 LLVM Attribute Description 792 ======================================= ========================================================== 793 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 794 will be specified when the kernel is dispatched. Generated 795 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 796 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 797 argument block size for the implicit arguments. This 798 varies by OS and language (for OpenCL see 799 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 800 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 801 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 802 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 803 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 804 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 805 execution unit. Generated by the ``amdgpu_waves_per_eu`` 806 CLANG attribute [CLANG-ATTR]_. 807 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the 808 mode register to be set on entry. Overrides the default for 809 the calling convention. 810 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of 811 the mode register to be set on entry. Overrides the default 812 for the calling convention. 813 ======================================= ========================================================== 814 815.. _amdgpu-elf-code-object: 816 817ELF Code Object 818=============== 819 820The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 821can be linked by ``lld`` to produce a standard ELF shared code object which can 822be loaded and executed on an AMDGPU target. 823 824.. _amdgpu-elf-header: 825 826Header 827------ 828 829The AMDGPU backend uses the following ELF header: 830 831 .. table:: AMDGPU ELF Header 832 :name: amdgpu-elf-header-table 833 834 ========================== =============================== 835 Field Value 836 ========================== =============================== 837 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 838 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 839 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 840 - ``ELFOSABI_AMDGPU_HSA`` 841 - ``ELFOSABI_AMDGPU_PAL`` 842 - ``ELFOSABI_AMDGPU_MESA3D`` 843 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2`` 844 - ``ELFABIVERSION_AMDGPU_HSA_V3`` 845 - ``ELFABIVERSION_AMDGPU_HSA_V4`` 846 - ``ELFABIVERSION_AMDGPU_PAL`` 847 - ``ELFABIVERSION_AMDGPU_MESA3D`` 848 ``e_type`` - ``ET_REL`` 849 - ``ET_DYN`` 850 ``e_machine`` ``EM_AMDGPU`` 851 ``e_entry`` 0 852 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`, 853 :ref:`amdgpu-elf-header-e_flags-table-v3`, 854 and :ref:`amdgpu-elf-header-e_flags-table-v4` 855 ========================== =============================== 856 857.. 858 859 .. table:: AMDGPU ELF Header Enumeration Values 860 :name: amdgpu-elf-header-enumeration-values-table 861 862 =============================== ===== 863 Name Value 864 =============================== ===== 865 ``EM_AMDGPU`` 224 866 ``ELFOSABI_NONE`` 0 867 ``ELFOSABI_AMDGPU_HSA`` 64 868 ``ELFOSABI_AMDGPU_PAL`` 65 869 ``ELFOSABI_AMDGPU_MESA3D`` 66 870 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0 871 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1 872 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2 873 ``ELFABIVERSION_AMDGPU_PAL`` 0 874 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 875 =============================== ===== 876 877``e_ident[EI_CLASS]`` 878 The ELF class is: 879 880 * ``ELFCLASS32`` for ``r600`` architecture. 881 882 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit 883 process address space applications. 884 885``e_ident[EI_DATA]`` 886 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 887 888``e_ident[EI_OSABI]`` 889 One of the following AMDGPU target architecture specific OS ABIs 890 (see :ref:`amdgpu-os`): 891 892 * ``ELFOSABI_NONE`` for *unknown* OS. 893 894 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 895 896 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 897 898 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 899 900``e_ident[EI_ABIVERSION]`` 901 The ABI version of the AMDGPU target architecture specific OS ABI to which the code 902 object conforms: 903 904 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA 905 runtime ABI for code object V2. Specify using the Clang option 906 ``-mcode-object-version=2``. 907 908 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA 909 runtime ABI for code object V3. Specify using the Clang option 910 ``-mcode-object-version=3``. 911 912 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA 913 runtime ABI for code object V4. Specify using the Clang option 914 ``-mcode-object-version=4``. This is the default code object 915 version if not specified. 916 917 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 918 runtime ABI. 919 920 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 921 3D runtime ABI. 922 923``e_type`` 924 Can be one of the following values: 925 926 927 ``ET_REL`` 928 The type produced by the AMDGPU backend compiler as it is relocatable code 929 object. 930 931 ``ET_DYN`` 932 The type produced by the linker as it is a shared code object. 933 934 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 935 936``e_machine`` 937 The value ``EM_AMDGPU`` is used for the machine for all processors supported 938 by the ``r600`` and ``amdgcn`` architectures (see 939 :ref:`amdgpu-processor-table`). The specific processor is specified in the 940 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see 941 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the 942 ``e_flags`` for code object V3 to V4 (see 943 :ref:`amdgpu-elf-header-e_flags-table-v3` and 944 :ref:`amdgpu-elf-header-e_flags-table-v4`). 945 946``e_entry`` 947 The entry point is 0 as the entry points for individual kernels must be 948 selected in order to invoke them through AQL packets. 949 950``e_flags`` 951 The AMDGPU backend uses the following ELF header flags: 952 953 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2 954 :name: amdgpu-elf-header-e_flags-v2-table 955 956 ===================================== ===== ============================= 957 Name Value Description 958 ===================================== ===== ============================= 959 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack`` 960 target feature is 961 enabled for all code 962 contained in the code object. 963 If the processor 964 does not support the 965 ``xnack`` target 966 feature then must 967 be 0. 968 See 969 :ref:`amdgpu-target-features`. 970 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap 971 handler is enabled for all 972 code contained in the code 973 object. If the processor 974 does not support a trap 975 handler then must be 0. 976 See 977 :ref:`amdgpu-target-features`. 978 ===================================== ===== ============================= 979 980 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3 981 :name: amdgpu-elf-header-e_flags-table-v3 982 983 ================================= ===== ============================= 984 Name Value Description 985 ================================= ===== ============================= 986 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 987 mask for 988 ``EF_AMDGPU_MACH_xxx`` values 989 defined in 990 :ref:`amdgpu-ef-amdgpu-mach-table`. 991 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack`` 992 target feature is 993 enabled for all code 994 contained in the code object. 995 If the processor 996 does not support the 997 ``xnack`` target 998 feature then must 999 be 0. 1000 See 1001 :ref:`amdgpu-target-features`. 1002 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc`` 1003 target feature is 1004 enabled for all code 1005 contained in the code object. 1006 If the processor 1007 does not support the 1008 ``sramecc`` target 1009 feature then must 1010 be 0. 1011 See 1012 :ref:`amdgpu-target-features`. 1013 ================================= ===== ============================= 1014 1015 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 1016 :name: amdgpu-elf-header-e_flags-table-v4 1017 1018 ============================================ ===== =================================== 1019 Name Value Description 1020 ============================================ ===== =================================== 1021 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection 1022 mask for 1023 ``EF_AMDGPU_MACH_xxx`` values 1024 defined in 1025 :ref:`amdgpu-ef-amdgpu-mach-table`. 1026 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for 1027 ``EF_AMDGPU_FEATURE_XNACK_*_V4`` 1028 values. 1029 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored. 1030 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value. 1031 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled. 1032 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled. 1033 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for 1034 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4`` 1035 values. 1036 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored. 1037 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value. 1038 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled, 1039 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled. 1040 ============================================ ===== =================================== 1041 1042 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 1043 :name: amdgpu-ef-amdgpu-mach-table 1044 1045 ==================================== ========== ============================= 1046 Name Value Description (see 1047 :ref:`amdgpu-processor-table`) 1048 ==================================== ========== ============================= 1049 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 1050 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 1051 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 1052 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 1053 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 1054 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 1055 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 1056 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 1057 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 1058 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 1059 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 1060 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 1061 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 1062 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 1063 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 1064 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 1065 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 1066 *reserved* 0x011 - Reserved for ``r600`` 1067 0x01f architecture processors. 1068 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 1069 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 1070 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 1071 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 1072 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 1073 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 1074 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 1075 *reserved* 0x027 Reserved. 1076 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 1077 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 1078 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 1079 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 1080 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 1081 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 1082 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 1083 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 1084 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` 1085 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` 1086 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c`` 1087 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` 1088 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` 1089 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` 1090 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` 1091 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` 1092 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032`` 1093 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033`` 1094 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602`` 1095 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705`` 1096 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805`` 1097 ==================================== ========== ============================= 1098 1099Sections 1100-------- 1101 1102An AMDGPU target ELF code object has the standard ELF sections which include: 1103 1104 .. table:: AMDGPU ELF Sections 1105 :name: amdgpu-elf-sections-table 1106 1107 ================== ================ ================================= 1108 Name Type Attributes 1109 ================== ================ ================================= 1110 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1111 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1112 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 1113 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 1114 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1115 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1116 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 1117 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 1118 ``.note`` ``SHT_NOTE`` *none* 1119 ``.rela``\ *name* ``SHT_RELA`` *none* 1120 ``.rela.dyn`` ``SHT_RELA`` *none* 1121 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 1122 ``.shstrtab`` ``SHT_STRTAB`` *none* 1123 ``.strtab`` ``SHT_STRTAB`` *none* 1124 ``.symtab`` ``SHT_SYMTAB`` *none* 1125 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 1126 ================== ================ ================================= 1127 1128These sections have their standard meanings (see [ELF]_) and are only generated 1129if needed. 1130 1131``.debug``\ *\** 1132 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for 1133 information on the DWARF produced by the AMDGPU backend. 1134 1135``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 1136 The standard sections used by a dynamic loader. 1137 1138``.note`` 1139 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 1140 backend. 1141 1142``.rela``\ *name*, ``.rela.dyn`` 1143 For relocatable code objects, *name* is the name of the section that the 1144 relocation records apply. For example, ``.rela.text`` is the section name for 1145 relocation records associated with the ``.text`` section. 1146 1147 For linked shared code objects, ``.rela.dyn`` contains all the relocation 1148 records from each of the relocatable code object's ``.rela``\ *name* sections. 1149 1150 See :ref:`amdgpu-relocation-records` for the relocation records supported by 1151 the AMDGPU backend. 1152 1153``.text`` 1154 The executable machine code for the kernels and functions they call. Generated 1155 as position independent code. See :ref:`amdgpu-code-conventions` for 1156 information on conventions used in the isa generation. 1157 1158.. _amdgpu-note-records: 1159 1160Note Records 1161------------ 1162 1163The AMDGPU backend code object contains ELF note records in the ``.note`` 1164section. The set of generated notes and their semantics depend on the code 1165object version; see :ref:`amdgpu-note-records-v2` and 1166:ref:`amdgpu-note-records-v3-v4`. 1167 1168As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding 1169must be generated after the ``name`` field to ensure the ``desc`` field is 4 1170byte aligned. In addition, minimal zero-byte padding must be generated to 1171ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` 1172field of the ``.note`` section must be at least 4 to indicate at least 8 byte 1173alignment. 1174 1175.. _amdgpu-note-records-v2: 1176 1177Code Object V2 Note Records 1178~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1179 1180.. warning:: 1181 Code object V2 is not the default code object version emitted by 1182 this version of LLVM. 1183 1184The AMDGPU backend code object uses the following ELF note record in the 1185``.note`` section when compiling for code object V2. 1186 1187The note record vendor field is "AMD". 1188 1189Additional note records may be present, but any which are not documented here 1190are deprecated and should not be used. 1191 1192 .. table:: AMDGPU Code Object V2 ELF Note Records 1193 :name: amdgpu-elf-note-records-v2-table 1194 1195 ===== ===================================== ====================================== 1196 Name Type Description 1197 ===== ===================================== ====================================== 1198 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version. 1199 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL 1200 Finalizer and not the LLVM compiler. 1201 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version. 1202 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in 1203 YAML [YAML]_ textual format. 1204 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name. 1205 ===== ===================================== ====================================== 1206 1207.. 1208 1209 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values 1210 :name: amdgpu-elf-note-record-enumeration-values-v2-table 1211 1212 ===================================== ===== 1213 Name Value 1214 ===================================== ===== 1215 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1 1216 ``NT_AMD_HSA_HSAIL`` 2 1217 ``NT_AMD_HSA_ISA_VERSION`` 3 1218 *reserved* 4-9 1219 ``NT_AMD_HSA_METADATA`` 10 1220 ``NT_AMD_HSA_ISA_NAME`` 11 1221 ===================================== ===== 1222 1223``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1224 Specifies the code object version number. The description field has the 1225 following layout: 1226 1227 .. code:: 1228 1229 struct amdgpu_hsa_note_code_object_version_s { 1230 uint32_t major_version; 1231 uint32_t minor_version; 1232 }; 1233 1234 The ``major_version`` has a value less than or equal to 2. 1235 1236``NT_AMD_HSA_HSAIL`` 1237 Specifies the HSAIL properties used by the HSAIL Finalizer. The description 1238 field has the following layout: 1239 1240 .. code:: 1241 1242 struct amdgpu_hsa_note_hsail_s { 1243 uint32_t hsail_major_version; 1244 uint32_t hsail_minor_version; 1245 uint8_t profile; 1246 uint8_t machine_model; 1247 uint8_t default_float_round; 1248 }; 1249 1250``NT_AMD_HSA_ISA_VERSION`` 1251 Specifies the target ISA version. The description field has the following layout: 1252 1253 .. code:: 1254 1255 struct amdgpu_hsa_note_isa_s { 1256 uint16_t vendor_name_size; 1257 uint16_t architecture_name_size; 1258 uint32_t major; 1259 uint32_t minor; 1260 uint32_t stepping; 1261 char vendor_and_architecture_name[1]; 1262 }; 1263 1264 ``vendor_name_size`` and ``architecture_name_size`` are the length of the 1265 vendor and architecture names respectively, including the NUL character. 1266 1267 ``vendor_and_architecture_name`` contains the NUL terminates string for the 1268 vendor, immediately followed by the NUL terminated string for the 1269 architecture. 1270 1271 This note record is used by the HSA runtime loader. 1272 1273 Code object V2 only supports a limited number of processors and has fixed 1274 settings for target features. See 1275 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of 1276 processors and the corresponding target ID. In the table the note record ISA 1277 name is a concatenation of the vendor name, architecture name, major, minor, 1278 and stepping separated by a ":". 1279 1280 The target ID column shows the processor name and fixed target features used 1281 by the LLVM compiler. The LLVM compiler does not generate a 1282 ``NT_AMD_HSA_HSAIL`` note record. 1283 1284 A code object generated by the Finalizer also uses code object V2 and always 1285 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and 1286 ``sramecc`` target feature is as shown in 1287 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack`` 1288 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` 1289 bit. 1290 1291``NT_AMD_HSA_ISA_NAME`` 1292 Specifies the target ISA name as a non-NUL terminated string. 1293 1294 This note record is not used by the HSA runtime loader. 1295 1296 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object 1297 V2's limited support of processors and fixed settings for target features. 1298 1299 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping 1300 from the string to the corresponding target ID. If the ``xnack`` target 1301 feature is supported and enabled, the string produced by the LLVM compiler 1302 will may have a ``+xnack`` appended. The Finlizer did not do the appending and 1303 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit. 1304 1305``NT_AMD_HSA_METADATA`` 1306 Specifies extensible metadata associated with the code objects executed on HSA 1307 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the 1308 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 1309 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object 1310 metadata string. 1311 1312 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings 1313 :name: amdgpu-elf-note-record-supported_processors-v2-table 1314 1315 ==================== ========================== 1316 Note Record ISA Name Target ID 1317 ==================== ========================== 1318 ``AMD:AMDGPU:6:0:0`` ``gfx600`` 1319 ``AMD:AMDGPU:6:0:1`` ``gfx601`` 1320 ``AMD:AMDGPU:6:0:2`` ``gfx602`` 1321 ``AMD:AMDGPU:7:0:0`` ``gfx700`` 1322 ``AMD:AMDGPU:7:0:1`` ``gfx701`` 1323 ``AMD:AMDGPU:7:0:2`` ``gfx702`` 1324 ``AMD:AMDGPU:7:0:3`` ``gfx703`` 1325 ``AMD:AMDGPU:7:0:4`` ``gfx704`` 1326 ``AMD:AMDGPU:7:0:5`` ``gfx705`` 1327 ``AMD:AMDGPU:8:0:0`` ``gfx801:xnack-`` 1328 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+`` 1329 ``AMD:AMDGPU:8:0:2`` ``gfx802`` 1330 ``AMD:AMDGPU:8:0:3`` ``gfx803`` 1331 ``AMD:AMDGPU:8:0:5`` ``gfx805`` 1332 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+`` 1333 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-`` 1334 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+`` 1335 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-`` 1336 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+`` 1337 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-`` 1338 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+`` 1339 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-`` 1340 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+`` 1341 ==================== ========================== 1342 1343.. _amdgpu-note-records-v3-v4: 1344 1345Code Object V3 to V4 Note Records 1346~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1347 1348The AMDGPU backend code object uses the following ELF note record in the 1349``.note`` section when compiling for code object V3 to V4. 1350 1351The note record vendor field is "AMDGPU". 1352 1353Additional note records may be present, but any which are not documented here 1354are deprecated and should not be used. 1355 1356 .. table:: AMDGPU Code Object V3 to V4 ELF Note Records 1357 :name: amdgpu-elf-note-records-table-v3-v4 1358 1359 ======== ============================== ====================================== 1360 Name Type Description 1361 ======== ============================== ====================================== 1362 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ 1363 binary format. 1364 ======== ============================== ====================================== 1365 1366.. 1367 1368 .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values 1369 :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4 1370 1371 ============================== ===== 1372 Name Value 1373 ============================== ===== 1374 *reserved* 0-31 1375 ``NT_AMDGPU_METADATA`` 32 1376 ============================== ===== 1377 1378``NT_AMDGPU_METADATA`` 1379 Specifies extensible metadata associated with an AMDGPU code object. It is 1380 encoded as a map in the Message Pack [MsgPack]_ binary data format. See 1381 :ref:`amdgpu-amdhsa-code-object-metadata-v3` and 1382 :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the 1383 ``amdhsa`` OS. 1384 1385.. _amdgpu-symbols: 1386 1387Symbols 1388------- 1389 1390Symbols include the following: 1391 1392 .. table:: AMDGPU ELF Symbols 1393 :name: amdgpu-elf-symbols-table 1394 1395 ===================== ================== ================ ================== 1396 Name Type Section Description 1397 ===================== ================== ================ ================== 1398 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 1399 - ``.rodata`` 1400 - ``.bss`` 1401 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 1402 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 1403 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS 1404 ===================== ================== ================ ================== 1405 1406Global variable 1407 Global variables both used and defined by the compilation unit. 1408 1409 If the symbol is defined in the compilation unit then it is allocated in the 1410 appropriate section according to if it has initialized data or is readonly. 1411 1412 If the symbol is external then its section is ``STN_UNDEF`` and the loader 1413 will resolve relocations using the definition provided by another code object 1414 or explicitly defined by the runtime. 1415 1416 If the symbol resides in local/group memory (LDS) then its section is the 1417 special processor specific section name ``SHN_AMDGPU_LDS``, and the 1418 ``st_value`` field describes alignment requirements as it does for common 1419 symbols. 1420 1421 .. TODO:: 1422 1423 Add description of linked shared object symbols. Seems undefined symbols 1424 are marked as STT_NOTYPE. 1425 1426Kernel descriptor 1427 Every HSA kernel has an associated kernel descriptor. It is the address of the 1428 kernel descriptor that is used in the AQL dispatch packet used to invoke the 1429 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 1430 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 1431 1432Kernel entry point 1433 Every HSA kernel also has a symbol for its machine code entry point. 1434 1435.. _amdgpu-relocation-records: 1436 1437Relocation Records 1438------------------ 1439 1440AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 1441relocatable fields are: 1442 1443``word32`` 1444 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 1445 alignment. These values use the same byte order as other word values in the 1446 AMDGPU architecture. 1447 1448``word64`` 1449 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 1450 alignment. These values use the same byte order as other word values in the 1451 AMDGPU architecture. 1452 1453Following notations are used for specifying relocation calculations: 1454 1455**A** 1456 Represents the addend used to compute the value of the relocatable field. 1457 1458**G** 1459 Represents the offset into the global offset table at which the relocation 1460 entry's symbol will reside during execution. 1461 1462**GOT** 1463 Represents the address of the global offset table. 1464 1465**P** 1466 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 1467 of the storage unit being relocated (computed using ``r_offset``). 1468 1469**S** 1470 Represents the value of the symbol whose index resides in the relocation 1471 entry. Relocations not using this must specify a symbol index of 1472 ``STN_UNDEF``. 1473 1474**B** 1475 Represents the base address of a loaded executable or shared object which is 1476 the difference between the ELF address and the actual load address. 1477 Relocations using this are only valid in executable or shared objects. 1478 1479The following relocation types are supported: 1480 1481 .. table:: AMDGPU ELF Relocation Records 1482 :name: amdgpu-elf-relocation-records-table 1483 1484 ========================== ======= ===== ========== ============================== 1485 Relocation Type Kind Value Field Calculation 1486 ========================== ======= ===== ========== ============================== 1487 ``R_AMDGPU_NONE`` 0 *none* *none* 1488 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 1489 Dynamic 1490 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 1491 Dynamic 1492 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 1493 Dynamic 1494 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 1495 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 1496 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 1497 Dynamic 1498 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 1499 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 1500 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 1501 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 1502 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 1503 *reserved* 12 1504 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 1505 ========================== ======= ===== ========== ============================== 1506 1507``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 1508the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 1509 1510There is no current OS loader support for 32-bit programs and so 1511``R_AMDGPU_ABS32`` is not used. 1512 1513.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: 1514 1515Loaded Code Object Path Uniform Resource Identifier (URI) 1516--------------------------------------------------------- 1517 1518The AMD GPU code object loader represents the path of the ELF shared object from 1519which the code object was loaded as a textual Unifom Resource Identifier (URI). 1520Note that the code object is the in memory loaded relocated form of the ELF 1521shared object. Multiple code objects may be loaded at different memory 1522addresses in the same process from the same ELF shared object. 1523 1524The loaded code object path URI syntax is defined by the following BNF syntax: 1525 1526.. code:: 1527 1528 code_object_uri ::== file_uri | memory_uri 1529 file_uri ::== "file://" file_path [ range_specifier ] 1530 memory_uri ::== "memory://" process_id range_specifier 1531 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number 1532 file_path ::== URI_ENCODED_OS_FILE_PATH 1533 process_id ::== DECIMAL_NUMBER 1534 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER 1535 1536**number** 1537 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", 1538 and octal values by "0". 1539 1540**file_path** 1541 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, 1542 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is 1543 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in 1544 the path are separated by "/". 1545 1546**offset** 1547 Is a 0-based byte offset to the start of the code object. For a file URI, it 1548 is from the start of the file specified by the ``file_path``, and if omitted 1549 defaults to 0. For a memory URI, it is the memory address and is required. 1550 1551**size** 1552 Is the number of bytes in the code object. For a file URI, if omitted it 1553 defaults to the size of the file. It is required for a memory URI. 1554 1555**process_id** 1556 Is the identity of the process owning the memory. For Linux it is the C 1557 unsigned integral decimal literal for the process ID (PID). 1558 1559For example: 1560 1561.. code:: 1562 1563 file:///dir1/dir2/file1 1564 file:///dir3/dir4/file2#offset=0x2000&size=3000 1565 memory://1234#offset=0x20000&size=3000 1566 1567.. _amdgpu-dwarf-debug-information: 1568 1569DWARF Debug Information 1570======================= 1571 1572.. warning:: 1573 1574 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that 1575 is not currently fully implemented and is subject to change. 1576 1577AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see 1578:ref:`amdgpu-elf-code-object`) which contain information that maps the code 1579object executable code and data to the source language constructs. It can be 1580used by tools such as debuggers and profilers. It uses features defined in 1581:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in 1582DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. 1583 1584This section defines the AMDGPU target architecture specific DWARF mappings. 1585 1586.. _amdgpu-dwarf-register-identifier: 1587 1588Register Identifier 1589------------------- 1590 1591This section defines the AMDGPU target architecture register numbers used in 1592DWARF operation expressions (see DWARF Version 5 section 2.5 and 1593:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information 1594instructions (see DWARF Version 5 section 6.4 and 1595:ref:`amdgpu-dwarf-call-frame-information`). 1596 1597A single code object can contain code for kernels that have different wavefront 1598sizes. The vector registers and some scalar registers are based on the wavefront 1599size. AMDGPU defines distinct DWARF registers for each wavefront size. This 1600simplifies the consumer of the DWARF so that each register has a fixed size, 1601rather than being dynamic according to the wavefront size mode. Similarly, 1602distinct DWARF registers are defined for those registers that vary in size 1603according to the process address size. This allows a consumer to treat a 1604specific AMDGPU processor as a single architecture regardless of how it is 1605configured at run time. The compiler explicitly specifies the DWARF registers 1606that match the mode in which the code it is generating will be executed. 1607 1608DWARF registers are encoded as numbers, which are mapped to architecture 1609registers. The mapping for AMDGPU is defined in 1610:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same 1611mapping. 1612 1613.. table:: AMDGPU DWARF Register Mapping 1614 :name: amdgpu-dwarf-register-mapping-table 1615 1616 ============== ================= ======== ================================== 1617 DWARF Register AMDGPU Register Bit Size Description 1618 ============== ================= ======== ================================== 1619 0 PC_32 32 Program Counter (PC) when 1620 executing in a 32-bit process 1621 address space. Used in the CFI to 1622 describe the PC of the calling 1623 frame. 1624 1 EXEC_MASK_32 32 Execution Mask Register when 1625 executing in wavefront 32 mode. 1626 2-15 *Reserved* *Reserved for highly accessed 1627 registers using DWARF shortcut.* 1628 16 PC_64 64 Program Counter (PC) when 1629 executing in a 64-bit process 1630 address space. Used in the CFI to 1631 describe the PC of the calling 1632 frame. 1633 17 EXEC_MASK_64 64 Execution Mask Register when 1634 executing in wavefront 64 mode. 1635 18-31 *Reserved* *Reserved for highly accessed 1636 registers using DWARF shortcut.* 1637 32-95 SGPR0-SGPR63 32 Scalar General Purpose 1638 Registers. 1639 96-127 *Reserved* *Reserved for frequently accessed 1640 registers using DWARF 1-byte ULEB.* 1641 128 STATUS 32 Status Register. 1642 129-511 *Reserved* *Reserved for future Scalar 1643 Architectural Registers.* 1644 512 VCC_32 32 Vector Condition Code Register 1645 when executing in wavefront 32 1646 mode. 1647 513-1023 *Reserved* *Reserved for future Vector 1648 Architectural Registers when 1649 executing in wavefront 32 mode.* 1650 768 VCC_64 64 Vector Condition Code Register 1651 when executing in wavefront 64 1652 mode. 1653 769-1023 *Reserved* *Reserved for future Vector 1654 Architectural Registers when 1655 executing in wavefront 64 mode.* 1656 1024-1087 *Reserved* *Reserved for padding.* 1657 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. 1658 1130-1535 *Reserved* *Reserved for future Scalar 1659 General Purpose Registers.* 1660 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers 1661 when executing in wavefront 32 1662 mode. 1663 1792-2047 *Reserved* *Reserved for future Vector 1664 General Purpose Registers when 1665 executing in wavefront 32 mode.* 1666 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers 1667 when executing in wavefront 32 1668 mode. 1669 2304-2559 *Reserved* *Reserved for future Vector 1670 Accumulation Registers when 1671 executing in wavefront 32 mode.* 1672 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers 1673 when executing in wavefront 64 1674 mode. 1675 2816-3071 *Reserved* *Reserved for future Vector 1676 General Purpose Registers when 1677 executing in wavefront 64 mode.* 1678 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers 1679 when executing in wavefront 64 1680 mode. 1681 3328-3583 *Reserved* *Reserved for future Vector 1682 Accumulation Registers when 1683 executing in wavefront 64 mode.* 1684 ============== ================= ======== ================================== 1685 1686The vector registers are represented as the full size for the wavefront. They 1687are organized as consecutive dwords (32-bits), one per lane, with the dword at 1688the least significant bit position corresponding to lane 0 and so forth. DWARF 1689location expressions involving the ``DW_OP_LLVM_offset`` and 1690``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector 1691register corresponding to the lane that is executing the current thread of 1692execution in languages that are implemented using a SIMD or SIMT execution 1693model. 1694 1695If the wavefront size is 32 lanes then the wavefront 32 mode register 1696definitions are used. If the wavefront size is 64 lanes then the wavefront 64 1697mode register definitions are used. Some AMDGPU targets support executing in 1698both wavefront 32 and wavefront 64 mode. The register definitions corresponding 1699to the wavefront mode of the generated code will be used. 1700 1701If code is generated to execute in a 32-bit process address space, then the 170232-bit process address space register definitions are used. If code is generated 1703to execute in a 64-bit process address space, then the 64-bit process address 1704space register definitions are used. The ``amdgcn`` target only supports the 170564-bit process address space. 1706 1707.. _amdgpu-dwarf-address-class-identifier: 1708 1709Address Class Identifier 1710------------------------ 1711 1712The DWARF address class represents the source language memory space. See DWARF 1713Version 5 section 2.12 which is updated by the *DWARF Extensions For 1714Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1715 1716The DWARF address class mapping used for AMDGPU is defined in 1717:ref:`amdgpu-dwarf-address-class-mapping-table`. 1718 1719.. table:: AMDGPU DWARF Address Class Mapping 1720 :name: amdgpu-dwarf-address-class-mapping-table 1721 1722 ========================= ====== ================= 1723 DWARF AMDGPU 1724 -------------------------------- ----------------- 1725 Address Class Name Value Address Space 1726 ========================= ====== ================= 1727 ``DW_ADDR_none`` 0x0000 Generic (Flat) 1728 ``DW_ADDR_LLVM_global`` 0x0001 Global 1729 ``DW_ADDR_LLVM_constant`` 0x0002 Global 1730 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) 1731 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) 1732 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) 1733 ========================= ====== ================= 1734 1735The DWARF address class values defined in the *DWARF Extensions For 1736Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. 1737 1738In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is 1739available for use for the AMD extension for access to the hardware GDS memory 1740which is scratchpad memory allocated per device. 1741 1742For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default 1743address class of ``DW_ADDR_none`` is used. 1744 1745See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU 1746mapping of DWARF address classes to DWARF address spaces, including address size 1747and NULL value. 1748 1749.. _amdgpu-dwarf-address-space-identifier: 1750 1751Address Space Identifier 1752------------------------ 1753 1754DWARF address spaces correspond to target architecture specific linear 1755addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions 1756For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. 1757 1758The DWARF address space mapping used for AMDGPU is defined in 1759:ref:`amdgpu-dwarf-address-space-mapping-table`. 1760 1761.. table:: AMDGPU DWARF Address Space Mapping 1762 :name: amdgpu-dwarf-address-space-mapping-table 1763 1764 ======================================= ===== ======= ======== ================= ======================= 1765 DWARF AMDGPU Notes 1766 --------------------------------------- ----- ---------------- ----------------- ----------------------- 1767 Address Space Name Value Address Bit Size Address Space 1768 --------------------------------------- ----- ------- -------- ----------------- ----------------------- 1769 .. 64-bit 32-bit 1770 process process 1771 address address 1772 space space 1773 ======================================= ===== ======= ======== ================= ======================= 1774 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space* 1775 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat) 1776 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS) 1777 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS) 1778 *Reserved* 0x04 1779 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane* 1780 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront* 1781 ======================================= ===== ======= ======== ================= ======================= 1782 1783See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces 1784including address size and NULL value. 1785 1786The ``DW_ASPACE_none`` address space is the default target architecture address 1787space used in DWARF operations that do not specify an address space. It 1788therefore has to map to the global address space so that the ``DW_OP_addr*`` and 1789related operations can refer to addresses in the program code. 1790 1791The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to 1792specify the flat address space. If the address corresponds to an address in the 1793local address space, then it corresponds to the wavefront that is executing the 1794focused thread of execution. If the address corresponds to an address in the 1795private address space, then it corresponds to the lane that is executing the 1796focused thread of execution for languages that are implemented using a SIMD or 1797SIMT execution model. 1798 1799.. note:: 1800 1801 CUDA-like languages such as HIP that do not have address spaces in the 1802 language type system, but do allow variables to be allocated in different 1803 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` 1804 address space in the DWARF expression operations as the default address space 1805 is the global address space. 1806 1807The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to 1808specify the local address space corresponding to the wavefront that is executing 1809the focused thread of execution. 1810 1811The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions 1812to specify the private address space corresponding to the lane that is executing 1813the focused thread of execution for languages that are implemented using a SIMD 1814or SIMT execution model. 1815 1816The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions 1817to specify the unswizzled private address space corresponding to the wavefront 1818that is executing the focused thread of execution. The wavefront view of private 1819memory is the per wavefront unswizzled backing memory layout defined in 1820:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first 1821location for the backing memory of the wavefront (namely the address is not 1822offset by ``wavefront-scratch-base``). The following formula can be used to 1823convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a 1824``DW_ASPACE_AMDGPU_private_wave`` address: 1825 1826:: 1827 1828 private-address-wavefront = 1829 ((private-address-lane / 4) * wavefront-size * 4) + 1830 (wavefront-lane-id * 4) + (private-address-lane % 4) 1831 1832If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start 1833of the dwords for each lane starting with lane 0 is required, then this 1834simplifies to: 1835 1836:: 1837 1838 private-address-wavefront = 1839 private-address-lane * wavefront-size 1840 1841A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a 1842complete spilled vector register back into a complete vector register in the 1843CFI. The frame pointer can be a private lane address which is dword aligned, 1844which can be shifted to multiply by the wavefront size, and then used to form a 1845private wavefront address that gives a location for a contiguous set of dwords, 1846one per lane, where the vector register dwords are spilled. The compiler knows 1847the wavefront size since it generates the code. Note that the type of the 1848address may have to be converted as the size of a 1849``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a 1850``DW_ASPACE_AMDGPU_private_wave`` address. 1851 1852.. _amdgpu-dwarf-lane-identifier: 1853 1854Lane identifier 1855--------------- 1856 1857DWARF lane identifies specify a target architecture lane position for hardware 1858that executes in a SIMD or SIMT manner, and on which a source language maps its 1859threads of execution onto those lanes. The DWARF lane identifier is pushed by 1860the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 1861section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* 1862section :ref:`amdgpu-dwarf-operation-expressions`. 1863 1864For AMDGPU, the lane identifier corresponds to the hardware lane ID of a 1865wavefront. It is numbered from 0 to the wavefront size minus 1. 1866 1867Operation Expressions 1868--------------------- 1869 1870DWARF expressions are used to compute program values and the locations of 1871program objects. See DWARF Version 5 section 2.5 and 1872:ref:`amdgpu-dwarf-operation-expressions`. 1873 1874DWARF location descriptions describe how to access storage which includes memory 1875and registers. When accessing storage on AMDGPU, bytes are ordered with least 1876significant bytes first, and bits are ordered within bytes with least 1877significant bits first. 1878 1879For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe 1880unwinding vector registers that are spilled under the execution mask to memory: 1881the zero-single location description is the vector register, and the one-single 1882location description is the spilled memory location description. The 1883``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the 1884memory location description. 1885 1886In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the 1887``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is 1888controlled by the execution mask. An undefined location description together 1889with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry 1890to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. 1891 1892Debugger Information Entry Attributes 1893------------------------------------- 1894 1895This section describes how certain debugger information entry attributes are 1896used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated 1897by *DWARF Extensions For Heterogeneous Debugging* section 1898:ref:`amdgpu-dwarf-debugging-information-entry-attributes`. 1899 1900.. _amdgpu-dwarf-dw-at-llvm-lane-pc: 1901 1902``DW_AT_LLVM_lane_pc`` 1903~~~~~~~~~~~~~~~~~~~~~~ 1904 1905For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program 1906location of the separate lanes of a SIMT thread. 1907 1908If the lane is an active lane then this will be the same as the current program 1909location. 1910 1911If the lane is inactive, but was active on entry to the subprogram, then this is 1912the program location in the subprogram at which execution of the lane is 1913conceptual positioned. 1914 1915If the lane was not active on entry to the subprogram, then this will be the 1916undefined location. A client debugger can check if the lane is part of a valid 1917work-group by checking that the lane is in the range of the associated 1918work-group within the grid, accounting for partial work-groups. If it is not, 1919then the debugger can omit any information for the lane. Otherwise, the debugger 1920may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the 1921calling subprogram until it finds a non-undefined location. Conceptually the 1922lane only has the call frames that it has a non-undefined 1923``DW_AT_LLVM_lane_pc``. 1924 1925The following example illustrates how the AMDGPU backend can generate a DWARF 1926location list expression for the nested ``IF/THEN/ELSE`` structures of the 1927following subprogram pseudo code for a target with 64 lanes per wavefront. 1928 1929.. code:: 1930 :number-lines: 1931 1932 SUBPROGRAM X 1933 BEGIN 1934 a; 1935 IF (c1) THEN 1936 b; 1937 IF (c2) THEN 1938 c; 1939 ELSE 1940 d; 1941 ENDIF 1942 e; 1943 ELSE 1944 f; 1945 ENDIF 1946 g; 1947 END 1948 1949The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the 1950execution mask (``EXEC``) to linearize the control flow. The condition is 1951evaluated to make a mask of the lanes for which the condition evaluates to true. 1952First the ``THEN`` region is executed by setting the ``EXEC`` mask to the 1953logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the 1954``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of 1955the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` 1956region the ``EXEC`` mask is restored to the value it had at the beginning of the 1957region. This is shown below. Other approaches are possible, but the basic 1958concept is the same. 1959 1960.. code:: 1961 :number-lines: 1962 1963 $lex_start: 1964 a; 1965 %1 = EXEC 1966 %2 = c1 1967 $lex_1_start: 1968 EXEC = %1 & %2 1969 $if_1_then: 1970 b; 1971 %3 = EXEC 1972 %4 = c2 1973 $lex_1_1_start: 1974 EXEC = %3 & %4 1975 $lex_1_1_then: 1976 c; 1977 EXEC = ~EXEC & %3 1978 $lex_1_1_else: 1979 d; 1980 EXEC = %3 1981 $lex_1_1_end: 1982 e; 1983 EXEC = ~EXEC & %1 1984 $lex_1_else: 1985 f; 1986 EXEC = %1 1987 $lex_1_end: 1988 g; 1989 $lex_end: 1990 1991To create the DWARF location list expression that defines the location 1992description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` 1993pseudo instruction can be used to annotate the linearized control flow. This can 1994be done by defining an artificial variable for the lane PC. The DWARF location 1995list expression created for it is used as the value of the 1996``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. 1997 1998A DWARF procedure is defined for each well nested structured control flow region 1999which provides the conceptual lane program location for a lane if it is not 2000active (namely it is divergent). The DWARF operation expression for each region 2001conceptually inherits the value of the immediately enclosing region and modifies 2002it according to the semantics of the region. 2003 2004For an ``IF/THEN/ELSE`` region the divergent program location is at the start of 2005the region for the ``THEN`` region since it is executed first. For the ``ELSE`` 2006region the divergent program location is at the end of the ``IF/THEN/ELSE`` 2007region since the ``THEN`` region has completed. 2008 2009The lane PC artificial variable is assigned at each region transition. It uses 2010the immediately enclosing region's DWARF procedure to compute the program 2011location for each lane assuming they are divergent, and then modifies the result 2012by inserting the current program location for each lane that the ``EXEC`` mask 2013indicates is active. 2014 2015By having separate DWARF procedures for each region, they can be reused to 2016define the value for any nested region. This reduces the total size of the DWARF 2017operation expressions. 2018 2019The following provides an example using pseudo LLVM MIR. 2020 2021.. code:: 2022 :number-lines: 2023 2024 $lex_start: 2025 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ 2026 DW_AT_name = "__uint64"; 2027 DW_AT_byte_size = 8; 2028 DW_AT_encoding = DW_ATE_unsigned; 2029 ]; 2030 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ 2031 DW_AT_name = "__active_lane_pc"; 2032 DW_AT_location = [ 2033 DW_OP_regx PC; 2034 DW_OP_LLVM_extend 64, 64; 2035 DW_OP_regval_type EXEC, %uint_64; 2036 DW_OP_LLVM_select_bit_piece 64, 64; 2037 ]; 2038 ]; 2039 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ 2040 DW_AT_name = "__divergent_lane_pc"; 2041 DW_AT_location = [ 2042 DW_OP_LLVM_undefined; 2043 DW_OP_LLVM_extend 64, 64; 2044 ]; 2045 ]; 2046 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2047 DW_OP_call_ref %__divergent_lane_pc; 2048 DW_OP_call_ref %__active_lane_pc; 2049 ]; 2050 a; 2051 %1 = EXEC; 2052 DBG_VALUE %1, $noreg, %__lex_1_save_exec; 2053 %2 = c1; 2054 $lex_1_start: 2055 EXEC = %1 & %2; 2056 $lex_1_then: 2057 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ 2058 DW_AT_name = "__divergent_lane_pc_1_then"; 2059 DW_AT_location = DIExpression[ 2060 DW_OP_call_ref %__divergent_lane_pc; 2061 DW_OP_addrx &lex_1_start; 2062 DW_OP_stack_value; 2063 DW_OP_LLVM_extend 64, 64; 2064 DW_OP_call_ref %__lex_1_save_exec; 2065 DW_OP_deref_type 64, %__uint_64; 2066 DW_OP_LLVM_select_bit_piece 64, 64; 2067 ]; 2068 ]; 2069 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2070 DW_OP_call_ref %__divergent_lane_pc_1_then; 2071 DW_OP_call_ref %__active_lane_pc; 2072 ]; 2073 b; 2074 %3 = EXEC; 2075 DBG_VALUE %3, %__lex_1_1_save_exec; 2076 %4 = c2; 2077 $lex_1_1_start: 2078 EXEC = %3 & %4; 2079 $lex_1_1_then: 2080 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ 2081 DW_AT_name = "__divergent_lane_pc_1_1_then"; 2082 DW_AT_location = DIExpression[ 2083 DW_OP_call_ref %__divergent_lane_pc_1_then; 2084 DW_OP_addrx &lex_1_1_start; 2085 DW_OP_stack_value; 2086 DW_OP_LLVM_extend 64, 64; 2087 DW_OP_call_ref %__lex_1_1_save_exec; 2088 DW_OP_deref_type 64, %__uint_64; 2089 DW_OP_LLVM_select_bit_piece 64, 64; 2090 ]; 2091 ]; 2092 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2093 DW_OP_call_ref %__divergent_lane_pc_1_1_then; 2094 DW_OP_call_ref %__active_lane_pc; 2095 ]; 2096 c; 2097 EXEC = ~EXEC & %3; 2098 $lex_1_1_else: 2099 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ 2100 DW_AT_name = "__divergent_lane_pc_1_1_else"; 2101 DW_AT_location = DIExpression[ 2102 DW_OP_call_ref %__divergent_lane_pc_1_then; 2103 DW_OP_addrx &lex_1_1_end; 2104 DW_OP_stack_value; 2105 DW_OP_LLVM_extend 64, 64; 2106 DW_OP_call_ref %__lex_1_1_save_exec; 2107 DW_OP_deref_type 64, %__uint_64; 2108 DW_OP_LLVM_select_bit_piece 64, 64; 2109 ]; 2110 ]; 2111 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2112 DW_OP_call_ref %__divergent_lane_pc_1_1_else; 2113 DW_OP_call_ref %__active_lane_pc; 2114 ]; 2115 d; 2116 EXEC = %3; 2117 $lex_1_1_end: 2118 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2119 DW_OP_call_ref %__divergent_lane_pc; 2120 DW_OP_call_ref %__active_lane_pc; 2121 ]; 2122 e; 2123 EXEC = ~EXEC & %1; 2124 $lex_1_else: 2125 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ 2126 DW_AT_name = "__divergent_lane_pc_1_else"; 2127 DW_AT_location = DIExpression[ 2128 DW_OP_call_ref %__divergent_lane_pc; 2129 DW_OP_addrx &lex_1_end; 2130 DW_OP_stack_value; 2131 DW_OP_LLVM_extend 64, 64; 2132 DW_OP_call_ref %__lex_1_save_exec; 2133 DW_OP_deref_type 64, %__uint_64; 2134 DW_OP_LLVM_select_bit_piece 64, 64; 2135 ]; 2136 ]; 2137 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ 2138 DW_OP_call_ref %__divergent_lane_pc_1_else; 2139 DW_OP_call_ref %__active_lane_pc; 2140 ]; 2141 f; 2142 EXEC = %1; 2143 $lex_1_end: 2144 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ 2145 DW_OP_call_ref %__divergent_lane_pc; 2146 DW_OP_call_ref %__active_lane_pc; 2147 ]; 2148 g; 2149 $lex_end: 2150 2151The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements 2152that are active, with the current program location. 2153 2154Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for 2155the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo 2156instruction, location list entries will be created that describe where the 2157artificial variables are allocated at any given program location. The compiler 2158may allocate them to registers or spill them to memory. 2159 2160The DWARF procedures for each region use the values of the saved execution mask 2161artificial variables to only update the lanes that are active on entry to the 2162region. All other lanes retain the value of the enclosing region where they were 2163last active. If they were not active on entry to the subprogram, then will have 2164the undefined location description. 2165 2166Other structured control flow regions can be handled similarly. For example, 2167loops would set the divergent program location for the region at the end of the 2168loop. Any lanes active will be in the loop, and any lanes not active must have 2169exited the loop. 2170 2171An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of 2172``IF/THEN/ELSE`` regions. 2173 2174The DWARF procedures can use the active lane artificial variable described in 2175:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual 2176``EXEC`` mask in order to support whole or quad wavefront mode. 2177 2178.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: 2179 2180``DW_AT_LLVM_active_lane`` 2181~~~~~~~~~~~~~~~~~~~~~~~~~~ 2182 2183The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information 2184entry is used to specify the lanes that are conceptually active for a SIMT 2185thread. 2186 2187The execution mask may be modified to implement whole or quad wavefront mode 2188operations. For example, all lanes may need to temporarily be made active to 2189execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, 2190update it to enable the necessary lanes, perform the operations, and then 2191restore the ``EXEC`` mask from the saved value. While executing the whole 2192wavefront region, the conceptual execution mask is the saved value, not the 2193``EXEC`` value. 2194 2195This is handled by defining an artificial variable for the active lane mask. The 2196active lane mask artificial variable would be the actual ``EXEC`` mask for 2197normal regions, and the saved execution mask for regions where the mask is 2198temporarily updated. The location list expression created for this artificial 2199variable is used to define the value of the ``DW_AT_LLVM_active_lane`` 2200attribute. 2201 2202``DW_AT_LLVM_augmentation`` 2203~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2204 2205For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit 2206debugger information entry has the following value for the augmentation string: 2207 2208:: 2209 2210 [amdgpu:v0.0] 2211 2212The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2213extensions used in the DWARF of the compilation unit. The version number 2214conforms to [SEMVER]_. 2215 2216Call Frame Information 2217---------------------- 2218 2219DWARF Call Frame Information (CFI) describes how a consumer can virtually 2220*unwind* call frames in a running process or core dump. See DWARF Version 5 2221section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. 2222 2223For AMDGPU, the Common Information Entry (CIE) fields have the following values: 2224 22251. ``augmentation`` string contains the following null-terminated UTF-8 string: 2226 2227 :: 2228 2229 [amd:v0.0] 2230 2231 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU 2232 extensions used in this CIE or to the FDEs that use it. The version number 2233 conforms to [SEMVER]_. 2234 22352. ``address_size`` for the ``Global`` address space is defined in 2236 :ref:`amdgpu-dwarf-address-space-identifier`. 2237 22383. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. 2239 22404. ``code_alignment_factor`` is 4 bytes. 2241 2242 .. TODO:: 2243 2244 Add to :ref:`amdgpu-processor-table` table. 2245 22465. ``data_alignment_factor`` is 4 bytes. 2247 2248 .. TODO:: 2249 2250 Add to :ref:`amdgpu-processor-table` table. 2251 22526. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` 2253 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. 2254 22557. ``initial_instructions`` Since a subprogram X with fewer registers can be 2256 called from subprogram Y that has more allocated, X will not change any of 2257 the extra registers as it cannot access them. Therefore, the default rule 2258 for all columns is ``same value``. 2259 2260For AMDGPU the register number follows the numbering defined in 2261:ref:`amdgpu-dwarf-register-identifier`. 2262 2263For AMDGPU the instructions are variable size. A consumer can subtract 1 from 2264the return address to get the address of a byte within the call site 2265instructions. See DWARF Version 5 section 6.4.4. 2266 2267Accelerated Access 2268------------------ 2269 2270See DWARF Version 5 section 6.1. 2271 2272Lookup By Name Section Header 2273~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2274 2275See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. 2276 2277For AMDGPU the lookup by name section header table: 2278 2279``augmentation_string_size`` (uword) 2280 2281 Set to the length of the ``augmentation_string`` value which is always a 2282 multiple of 4. 2283 2284``augmentation_string`` (sequence of UTF-8 characters) 2285 2286 Contains the following UTF-8 string null padded to a multiple of 4 bytes: 2287 2288 :: 2289 2290 [amdgpu:v0.0] 2291 2292 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU 2293 extensions used in the DWARF of this index. The version number conforms to 2294 [SEMVER]_. 2295 2296 .. note:: 2297 2298 This is different to the DWARF Version 5 definition that requires the first 2299 4 characters to be the vendor ID. But this is consistent with the other 2300 augmentation strings and does allow multiple vendor contributions. However, 2301 backwards compatibility may be more desirable. 2302 2303Lookup By Address Section Header 2304~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2305 2306See DWARF Version 5 section 6.1.2. 2307 2308For AMDGPU the lookup by address section header table: 2309 2310``address_size`` (ubyte) 2311 2312 Match the address size for the ``Global`` address space defined in 2313 :ref:`amdgpu-dwarf-address-space-identifier`. 2314 2315``segment_selector_size`` (ubyte) 2316 2317 AMDGPU does not use a segment selector so this is 0. The entries in the 2318 ``.debug_aranges`` do not have a segment selector. 2319 2320Line Number Information 2321----------------------- 2322 2323See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. 2324 2325AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. 2326The instruction set must be obtained from the ELF file header ``e_flags`` field 2327in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header 2328<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2. 2329 2330.. TODO:: 2331 2332 Should the ``isa`` state machine register be used to indicate if the code is 2333 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? 2334 2335For AMDGPU the line number program header fields have the following values (see 2336DWARF Version 5 section 6.2.4): 2337 2338``address_size`` (ubyte) 2339 Matches the address size for the ``Global`` address space defined in 2340 :ref:`amdgpu-dwarf-address-space-identifier`. 2341 2342``segment_selector_size`` (ubyte) 2343 AMDGPU does not use a segment selector so this is 0. 2344 2345``minimum_instruction_length`` (ubyte) 2346 For GFX9-GFX10 this is 4. 2347 2348``maximum_operations_per_instruction`` (ubyte) 2349 For GFX9-GFX10 this is 1. 2350 2351Source text for online-compiled programs (for example, those compiled by the 2352OpenCL language runtime) may be embedded into the DWARF Version 5 line table. 2353See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For 2354Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source 2355<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`. 2356 2357The Clang option used to control source embedding in AMDGPU is defined in 2358:ref:`amdgpu-clang-debug-options-table`. 2359 2360 .. table:: AMDGPU Clang Debug Options 2361 :name: amdgpu-clang-debug-options-table 2362 2363 ==================== ================================================== 2364 Debug Flag Description 2365 ==================== ================================================== 2366 -g[no-]embed-source Enable/disable embedding source text in DWARF 2367 debug sections. Useful for environments where 2368 source cannot be written to disk, such as 2369 when performing online compilation. 2370 ==================== ================================================== 2371 2372For example: 2373 2374``-gembed-source`` 2375 Enable the embedded source. 2376 2377``-gno-embed-source`` 2378 Disable the embedded source. 2379 238032-Bit and 64-Bit DWARF Formats 2381------------------------------- 2382 2383See DWARF Version 5 section 7.4 and 2384:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. 2385 2386For AMDGPU: 2387 2388* For the ``amdgcn`` target architecture only the 64-bit process address space 2389 is supported. 2390 2391* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates 2392 the 32-bit DWARF format. 2393 2394Unit Headers 2395------------ 2396 2397For AMDGPU the following values apply for each of the unit headers described in 2398DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: 2399 2400``address_size`` (ubyte) 2401 Matches the address size for the ``Global`` address space defined in 2402 :ref:`amdgpu-dwarf-address-space-identifier`. 2403 2404.. _amdgpu-code-conventions: 2405 2406Code Conventions 2407================ 2408 2409This section provides code conventions used for each supported target triple OS 2410(see :ref:`amdgpu-target-triples`). 2411 2412AMDHSA 2413------ 2414 2415This section provides code conventions used when the target triple OS is 2416``amdhsa`` (see :ref:`amdgpu-target-triples`). 2417 2418.. _amdgpu-amdhsa-code-object-metadata: 2419 2420Code Object Metadata 2421~~~~~~~~~~~~~~~~~~~~ 2422 2423The code object metadata specifies extensible metadata associated with the code 2424objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The 2425encoding and semantics of this metadata depends on the code object version; see 2426:ref:`amdgpu-amdhsa-code-object-metadata-v2`, 2427:ref:`amdgpu-amdhsa-code-object-metadata-v3`, and 2428:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 2429 2430Code object metadata is specified in a note record (see 2431:ref:`amdgpu-note-records`) and is required when the target triple OS is 2432``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 2433information necessary to support the HSA compatible runtime kernel queries. For 2434example, the segment sizes needed in a dispatch packet. In addition, a 2435high-level language runtime may require other information to be included. For 2436example, the AMD OpenCL runtime records kernel argument information. 2437 2438.. _amdgpu-amdhsa-code-object-metadata-v2: 2439 2440Code Object V2 Metadata 2441+++++++++++++++++++++++ 2442 2443.. warning:: 2444 Code object V2 is not the default code object version emitted by this version 2445 of LLVM. 2446 2447Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record 2448(see :ref:`amdgpu-note-records-v2`). 2449 2450The metadata is specified as a YAML formatted string (see [YAML]_ and 2451:doc:`YamlIO`). 2452 2453.. TODO:: 2454 2455 Is the string null terminated? It probably should not if YAML allows it to 2456 contain null characters, otherwise it should be. 2457 2458The metadata is represented as a single YAML document comprised of the mapping 2459defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and 2460referenced tables. 2461 2462For boolean values, the string values of ``false`` and ``true`` are used for 2463false and true respectively. 2464 2465Additional information can be added to the mappings. To avoid conflicts, any 2466non-AMD key names should be prefixed by "*vendor-name*.". 2467 2468 .. table:: AMDHSA Code Object V2 Metadata Map 2469 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table 2470 2471 ========== ============== ========= ======================================= 2472 String Key Value Type Required? Description 2473 ========== ============== ========= ======================================= 2474 "Version" sequence of Required - The first integer is the major 2475 2 integers version. Currently 1. 2476 - The second integer is the minor 2477 version. Currently 0. 2478 "Printf" sequence of Each string is encoded information 2479 strings about a printf function call. The 2480 encoded information is organized as 2481 fields separated by colon (':'): 2482 2483 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2484 2485 where: 2486 2487 ``ID`` 2488 A 32-bit integer as a unique id for 2489 each printf function call 2490 2491 ``N`` 2492 A 32-bit integer equal to the number 2493 of arguments of printf function call 2494 minus 1 2495 2496 ``S[i]`` (where i = 0, 1, ... , N-1) 2497 32-bit integers for the size in bytes 2498 of the i-th FormatString argument of 2499 the printf function call 2500 2501 FormatString 2502 The format string passed to the 2503 printf function call. 2504 "Kernels" sequence of Required Sequence of the mappings for each 2505 mapping kernel in the code object. See 2506 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table` 2507 for the definition of the mapping. 2508 ========== ============== ========= ======================================= 2509 2510.. 2511 2512 .. table:: AMDHSA Code Object V2 Kernel Metadata Map 2513 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table 2514 2515 ================= ============== ========= ================================ 2516 String Key Value Type Required? Description 2517 ================= ============== ========= ================================ 2518 "Name" string Required Source name of the kernel. 2519 "SymbolName" string Required Name of the kernel 2520 descriptor ELF symbol. 2521 "Language" string Source language of the kernel. 2522 Values include: 2523 2524 - "OpenCL C" 2525 - "OpenCL C++" 2526 - "HCC" 2527 - "OpenMP" 2528 2529 "LanguageVersion" sequence of - The first integer is the major 2530 2 integers version. 2531 - The second integer is the 2532 minor version. 2533 "Attrs" mapping Mapping of kernel attributes. 2534 See 2535 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table` 2536 for the mapping definition. 2537 "Args" sequence of Sequence of mappings of the 2538 mapping kernel arguments. See 2539 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table` 2540 for the definition of the mapping. 2541 "CodeProps" mapping Mapping of properties related to 2542 the kernel code. See 2543 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table` 2544 for the mapping definition. 2545 ================= ============== ========= ================================ 2546 2547.. 2548 2549 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map 2550 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table 2551 2552 =================== ============== ========= ============================== 2553 String Key Value Type Required? Description 2554 =================== ============== ========= ============================== 2555 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 2556 3 integers must be >=1 and the dispatch 2557 work-group size X, Y, Z must 2558 correspond to the specified 2559 values. Defaults to 0, 0, 0. 2560 2561 Corresponds to the OpenCL 2562 ``reqd_work_group_size`` 2563 attribute. 2564 "WorkGroupSizeHint" sequence of The dispatch work-group size 2565 3 integers X, Y, Z is likely to be the 2566 specified values. 2567 2568 Corresponds to the OpenCL 2569 ``work_group_size_hint`` 2570 attribute. 2571 "VecTypeHint" string The name of a scalar or vector 2572 type. 2573 2574 Corresponds to the OpenCL 2575 ``vec_type_hint`` attribute. 2576 2577 "RuntimeHandle" string The external symbol name 2578 associated with a kernel. 2579 OpenCL runtime allocates a 2580 global buffer for the symbol 2581 and saves the kernel's address 2582 to it, which is used for 2583 device side enqueueing. Only 2584 available for device side 2585 enqueued kernels. 2586 =================== ============== ========= ============================== 2587 2588.. 2589 2590 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map 2591 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table 2592 2593 ================= ============== ========= ================================ 2594 String Key Value Type Required? Description 2595 ================= ============== ========= ================================ 2596 "Name" string Kernel argument name. 2597 "TypeName" string Kernel argument type name. 2598 "Size" integer Required Kernel argument size in bytes. 2599 "Align" integer Required Kernel argument alignment in 2600 bytes. Must be a power of two. 2601 "ValueKind" string Required Kernel argument kind that 2602 specifies how to set up the 2603 corresponding argument. 2604 Values include: 2605 2606 "ByValue" 2607 The argument is copied 2608 directly into the kernarg. 2609 2610 "GlobalBuffer" 2611 A global address space pointer 2612 to the buffer data is passed 2613 in the kernarg. 2614 2615 "DynamicSharedPointer" 2616 A group address space pointer 2617 to dynamically allocated LDS 2618 is passed in the kernarg. 2619 2620 "Sampler" 2621 A global address space 2622 pointer to a S# is passed in 2623 the kernarg. 2624 2625 "Image" 2626 A global address space 2627 pointer to a T# is passed in 2628 the kernarg. 2629 2630 "Pipe" 2631 A global address space pointer 2632 to an OpenCL pipe is passed in 2633 the kernarg. 2634 2635 "Queue" 2636 A global address space pointer 2637 to an OpenCL device enqueue 2638 queue is passed in the 2639 kernarg. 2640 2641 "HiddenGlobalOffsetX" 2642 The OpenCL grid dispatch 2643 global offset for the X 2644 dimension is passed in the 2645 kernarg. 2646 2647 "HiddenGlobalOffsetY" 2648 The OpenCL grid dispatch 2649 global offset for the Y 2650 dimension is passed in the 2651 kernarg. 2652 2653 "HiddenGlobalOffsetZ" 2654 The OpenCL grid dispatch 2655 global offset for the Z 2656 dimension is passed in the 2657 kernarg. 2658 2659 "HiddenNone" 2660 An argument that is not used 2661 by the kernel. Space needs to 2662 be left for it, but it does 2663 not need to be set up. 2664 2665 "HiddenPrintfBuffer" 2666 A global address space pointer 2667 to the runtime printf buffer 2668 is passed in kernarg. 2669 2670 "HiddenHostcallBuffer" 2671 A global address space pointer 2672 to the runtime hostcall buffer 2673 is passed in kernarg. 2674 2675 "HiddenDefaultQueue" 2676 A global address space pointer 2677 to the OpenCL device enqueue 2678 queue that should be used by 2679 the kernel by default is 2680 passed in the kernarg. 2681 2682 "HiddenCompletionAction" 2683 A global address space pointer 2684 to help link enqueued kernels into 2685 the ancestor tree for determining 2686 when the parent kernel has finished. 2687 2688 "HiddenMultiGridSyncArg" 2689 A global address space pointer for 2690 multi-grid synchronization is 2691 passed in the kernarg. 2692 2693 "ValueType" string Unused and deprecated. This should no longer 2694 be emitted, but is accepted for compatibility. 2695 2696 2697 "PointeeAlign" integer Alignment in bytes of pointee 2698 type for pointer type kernel 2699 argument. Must be a power 2700 of 2. Only present if 2701 "ValueKind" is 2702 "DynamicSharedPointer". 2703 "AddrSpaceQual" string Kernel argument address space 2704 qualifier. Only present if 2705 "ValueKind" is "GlobalBuffer" or 2706 "DynamicSharedPointer". Values 2707 are: 2708 2709 - "Private" 2710 - "Global" 2711 - "Constant" 2712 - "Local" 2713 - "Generic" 2714 - "Region" 2715 2716 .. TODO:: 2717 2718 Is GlobalBuffer only Global 2719 or Constant? Is 2720 DynamicSharedPointer always 2721 Local? Can HCC allow Generic? 2722 How can Private or Region 2723 ever happen? 2724 2725 "AccQual" string Kernel argument access 2726 qualifier. Only present if 2727 "ValueKind" is "Image" or 2728 "Pipe". Values 2729 are: 2730 2731 - "ReadOnly" 2732 - "WriteOnly" 2733 - "ReadWrite" 2734 2735 .. TODO:: 2736 2737 Does this apply to 2738 GlobalBuffer? 2739 2740 "ActualAccQual" string The actual memory accesses 2741 performed by the kernel on the 2742 kernel argument. Only present if 2743 "ValueKind" is "GlobalBuffer", 2744 "Image", or "Pipe". This may be 2745 more restrictive than indicated 2746 by "AccQual" to reflect what the 2747 kernel actual does. If not 2748 present then the runtime must 2749 assume what is implied by 2750 "AccQual" and "IsConst". Values 2751 are: 2752 2753 - "ReadOnly" 2754 - "WriteOnly" 2755 - "ReadWrite" 2756 2757 "IsConst" boolean Indicates if the kernel argument 2758 is const qualified. Only present 2759 if "ValueKind" is 2760 "GlobalBuffer". 2761 2762 "IsRestrict" boolean Indicates if the kernel argument 2763 is restrict qualified. Only 2764 present if "ValueKind" is 2765 "GlobalBuffer". 2766 2767 "IsVolatile" boolean Indicates if the kernel argument 2768 is volatile qualified. Only 2769 present if "ValueKind" is 2770 "GlobalBuffer". 2771 2772 "IsPipe" boolean Indicates if the kernel argument 2773 is pipe qualified. Only present 2774 if "ValueKind" is "Pipe". 2775 2776 .. TODO:: 2777 2778 Can GlobalBuffer be pipe 2779 qualified? 2780 2781 ================= ============== ========= ================================ 2782 2783.. 2784 2785 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map 2786 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table 2787 2788 ============================ ============== ========= ===================== 2789 String Key Value Type Required? Description 2790 ============================ ============== ========= ===================== 2791 "KernargSegmentSize" integer Required The size in bytes of 2792 the kernarg segment 2793 that holds the values 2794 of the arguments to 2795 the kernel. 2796 "GroupSegmentFixedSize" integer Required The amount of group 2797 segment memory 2798 required by a 2799 work-group in 2800 bytes. This does not 2801 include any 2802 dynamically allocated 2803 group segment memory 2804 that may be added 2805 when the kernel is 2806 dispatched. 2807 "PrivateSegmentFixedSize" integer Required The amount of fixed 2808 private address space 2809 memory required for a 2810 work-item in 2811 bytes. If the kernel 2812 uses a dynamic call 2813 stack then additional 2814 space must be added 2815 to this value for the 2816 call stack. 2817 "KernargSegmentAlign" integer Required The maximum byte 2818 alignment of 2819 arguments in the 2820 kernarg segment. Must 2821 be a power of 2. 2822 "WavefrontSize" integer Required Wavefront size. Must 2823 be a power of 2. 2824 "NumSGPRs" integer Required Number of scalar 2825 registers used by a 2826 wavefront for 2827 GFX6-GFX10. This 2828 includes the special 2829 SGPRs for VCC, Flat 2830 Scratch (GFX7-GFX10) 2831 and XNACK (for 2832 GFX8-GFX10). It does 2833 not include the 16 2834 SGPR added if a trap 2835 handler is 2836 enabled. It is not 2837 rounded up to the 2838 allocation 2839 granularity. 2840 "NumVGPRs" integer Required Number of vector 2841 registers used by 2842 each work-item for 2843 GFX6-GFX10 2844 "MaxFlatWorkGroupSize" integer Required Maximum flat 2845 work-group size 2846 supported by the 2847 kernel in work-items. 2848 Must be >=1 and 2849 consistent with 2850 ReqdWorkGroupSize if 2851 not 0, 0, 0. 2852 "NumSpilledSGPRs" integer Number of stores from 2853 a scalar register to 2854 a register allocator 2855 created spill 2856 location. 2857 "NumSpilledVGPRs" integer Number of stores from 2858 a vector register to 2859 a register allocator 2860 created spill 2861 location. 2862 ============================ ============== ========= ===================== 2863 2864.. _amdgpu-amdhsa-code-object-metadata-v3: 2865 2866Code Object V3 Metadata 2867+++++++++++++++++++++++ 2868 2869.. warning:: 2870 Code object V3 is not the default code object version emitted by this version 2871 of LLVM. 2872 2873Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note 2874record (see :ref:`amdgpu-note-records-v3-v4`). 2875 2876The metadata is represented as Message Pack formatted binary data (see 2877[MsgPack]_). The top level is a Message Pack map that includes the 2878keys defined in table 2879:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced 2880tables. 2881 2882Additional information can be added to the maps. To avoid conflicts, 2883any key names should be prefixed by "*vendor-name*." where 2884``vendor-name`` can be the name of the vendor and specific vendor 2885tool that generates the information. The prefix is abbreviated to 2886simply "." when it appears within a map that has been added by the 2887same *vendor-name*. 2888 2889 .. table:: AMDHSA Code Object V3 Metadata Map 2890 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 2891 2892 ================= ============== ========= ======================================= 2893 String Key Value Type Required? Description 2894 ================= ============== ========= ======================================= 2895 "amdhsa.version" sequence of Required - The first integer is the major 2896 2 integers version. Currently 1. 2897 - The second integer is the minor 2898 version. Currently 0. 2899 "amdhsa.printf" sequence of Each string is encoded information 2900 strings about a printf function call. The 2901 encoded information is organized as 2902 fields separated by colon (':'): 2903 2904 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 2905 2906 where: 2907 2908 ``ID`` 2909 A 32-bit integer as a unique id for 2910 each printf function call 2911 2912 ``N`` 2913 A 32-bit integer equal to the number 2914 of arguments of printf function call 2915 minus 1 2916 2917 ``S[i]`` (where i = 0, 1, ... , N-1) 2918 32-bit integers for the size in bytes 2919 of the i-th FormatString argument of 2920 the printf function call 2921 2922 FormatString 2923 The format string passed to the 2924 printf function call. 2925 "amdhsa.kernels" sequence of Required Sequence of the maps for each 2926 map kernel in the code object. See 2927 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` 2928 for the definition of the keys included 2929 in that map. 2930 ================= ============== ========= ======================================= 2931 2932.. 2933 2934 .. table:: AMDHSA Code Object V3 Kernel Metadata Map 2935 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 2936 2937 =================================== ============== ========= ================================ 2938 String Key Value Type Required? Description 2939 =================================== ============== ========= ================================ 2940 ".name" string Required Source name of the kernel. 2941 ".symbol" string Required Name of the kernel 2942 descriptor ELF symbol. 2943 ".language" string Source language of the kernel. 2944 Values include: 2945 2946 - "OpenCL C" 2947 - "OpenCL C++" 2948 - "HCC" 2949 - "HIP" 2950 - "OpenMP" 2951 - "Assembler" 2952 2953 ".language_version" sequence of - The first integer is the major 2954 2 integers version. 2955 - The second integer is the 2956 minor version. 2957 ".args" sequence of Sequence of maps of the 2958 map kernel arguments. See 2959 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` 2960 for the definition of the keys 2961 included in that map. 2962 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values 2963 3 integers must be >=1 and the dispatch 2964 work-group size X, Y, Z must 2965 correspond to the specified 2966 values. Defaults to 0, 0, 0. 2967 2968 Corresponds to the OpenCL 2969 ``reqd_work_group_size`` 2970 attribute. 2971 ".workgroup_size_hint" sequence of The dispatch work-group size 2972 3 integers X, Y, Z is likely to be the 2973 specified values. 2974 2975 Corresponds to the OpenCL 2976 ``work_group_size_hint`` 2977 attribute. 2978 ".vec_type_hint" string The name of a scalar or vector 2979 type. 2980 2981 Corresponds to the OpenCL 2982 ``vec_type_hint`` attribute. 2983 2984 ".device_enqueue_symbol" string The external symbol name 2985 associated with a kernel. 2986 OpenCL runtime allocates a 2987 global buffer for the symbol 2988 and saves the kernel's address 2989 to it, which is used for 2990 device side enqueueing. Only 2991 available for device side 2992 enqueued kernels. 2993 ".kernarg_segment_size" integer Required The size in bytes of 2994 the kernarg segment 2995 that holds the values 2996 of the arguments to 2997 the kernel. 2998 ".group_segment_fixed_size" integer Required The amount of group 2999 segment memory 3000 required by a 3001 work-group in 3002 bytes. This does not 3003 include any 3004 dynamically allocated 3005 group segment memory 3006 that may be added 3007 when the kernel is 3008 dispatched. 3009 ".private_segment_fixed_size" integer Required The amount of fixed 3010 private address space 3011 memory required for a 3012 work-item in 3013 bytes. If the kernel 3014 uses a dynamic call 3015 stack then additional 3016 space must be added 3017 to this value for the 3018 call stack. 3019 ".kernarg_segment_align" integer Required The maximum byte 3020 alignment of 3021 arguments in the 3022 kernarg segment. Must 3023 be a power of 2. 3024 ".wavefront_size" integer Required Wavefront size. Must 3025 be a power of 2. 3026 ".sgpr_count" integer Required Number of scalar 3027 registers required by a 3028 wavefront for 3029 GFX6-GFX9. A register 3030 is required if it is 3031 used explicitly, or 3032 if a higher numbered 3033 register is used 3034 explicitly. This 3035 includes the special 3036 SGPRs for VCC, Flat 3037 Scratch (GFX7-GFX9) 3038 and XNACK (for 3039 GFX8-GFX9). It does 3040 not include the 16 3041 SGPR added if a trap 3042 handler is 3043 enabled. It is not 3044 rounded up to the 3045 allocation 3046 granularity. 3047 ".vgpr_count" integer Required Number of vector 3048 registers required by 3049 each work-item for 3050 GFX6-GFX9. A register 3051 is required if it is 3052 used explicitly, or 3053 if a higher numbered 3054 register is used 3055 explicitly. 3056 ".max_flat_workgroup_size" integer Required Maximum flat 3057 work-group size 3058 supported by the 3059 kernel in work-items. 3060 Must be >=1 and 3061 consistent with 3062 ReqdWorkGroupSize if 3063 not 0, 0, 0. 3064 ".sgpr_spill_count" integer Number of stores from 3065 a scalar register to 3066 a register allocator 3067 created spill 3068 location. 3069 ".vgpr_spill_count" integer Number of stores from 3070 a vector register to 3071 a register allocator 3072 created spill 3073 location. 3074 =================================== ============== ========= ================================ 3075 3076.. 3077 3078 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map 3079 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 3080 3081 ====================== ============== ========= ================================ 3082 String Key Value Type Required? Description 3083 ====================== ============== ========= ================================ 3084 ".name" string Kernel argument name. 3085 ".type_name" string Kernel argument type name. 3086 ".size" integer Required Kernel argument size in bytes. 3087 ".offset" integer Required Kernel argument offset in 3088 bytes. The offset must be a 3089 multiple of the alignment 3090 required by the argument. 3091 ".value_kind" string Required Kernel argument kind that 3092 specifies how to set up the 3093 corresponding argument. 3094 Values include: 3095 3096 "by_value" 3097 The argument is copied 3098 directly into the kernarg. 3099 3100 "global_buffer" 3101 A global address space pointer 3102 to the buffer data is passed 3103 in the kernarg. 3104 3105 "dynamic_shared_pointer" 3106 A group address space pointer 3107 to dynamically allocated LDS 3108 is passed in the kernarg. 3109 3110 "sampler" 3111 A global address space 3112 pointer to a S# is passed in 3113 the kernarg. 3114 3115 "image" 3116 A global address space 3117 pointer to a T# is passed in 3118 the kernarg. 3119 3120 "pipe" 3121 A global address space pointer 3122 to an OpenCL pipe is passed in 3123 the kernarg. 3124 3125 "queue" 3126 A global address space pointer 3127 to an OpenCL device enqueue 3128 queue is passed in the 3129 kernarg. 3130 3131 "hidden_global_offset_x" 3132 The OpenCL grid dispatch 3133 global offset for the X 3134 dimension is passed in the 3135 kernarg. 3136 3137 "hidden_global_offset_y" 3138 The OpenCL grid dispatch 3139 global offset for the Y 3140 dimension is passed in the 3141 kernarg. 3142 3143 "hidden_global_offset_z" 3144 The OpenCL grid dispatch 3145 global offset for the Z 3146 dimension is passed in the 3147 kernarg. 3148 3149 "hidden_none" 3150 An argument that is not used 3151 by the kernel. Space needs to 3152 be left for it, but it does 3153 not need to be set up. 3154 3155 "hidden_printf_buffer" 3156 A global address space pointer 3157 to the runtime printf buffer 3158 is passed in kernarg. 3159 3160 "hidden_hostcall_buffer" 3161 A global address space pointer 3162 to the runtime hostcall buffer 3163 is passed in kernarg. 3164 3165 "hidden_default_queue" 3166 A global address space pointer 3167 to the OpenCL device enqueue 3168 queue that should be used by 3169 the kernel by default is 3170 passed in the kernarg. 3171 3172 "hidden_completion_action" 3173 A global address space pointer 3174 to help link enqueued kernels into 3175 the ancestor tree for determining 3176 when the parent kernel has finished. 3177 3178 "hidden_multigrid_sync_arg" 3179 A global address space pointer for 3180 multi-grid synchronization is 3181 passed in the kernarg. 3182 3183 ".value_type" string Unused and deprecated. This should no longer 3184 be emitted, but is accepted for compatibility. 3185 3186 ".pointee_align" integer Alignment in bytes of pointee 3187 type for pointer type kernel 3188 argument. Must be a power 3189 of 2. Only present if 3190 ".value_kind" is 3191 "dynamic_shared_pointer". 3192 ".address_space" string Kernel argument address space 3193 qualifier. Only present if 3194 ".value_kind" is "global_buffer" or 3195 "dynamic_shared_pointer". Values 3196 are: 3197 3198 - "private" 3199 - "global" 3200 - "constant" 3201 - "local" 3202 - "generic" 3203 - "region" 3204 3205 .. TODO:: 3206 3207 Is "global_buffer" only "global" 3208 or "constant"? Is 3209 "dynamic_shared_pointer" always 3210 "local"? Can HCC allow "generic"? 3211 How can "private" or "region" 3212 ever happen? 3213 3214 ".access" string Kernel argument access 3215 qualifier. Only present if 3216 ".value_kind" is "image" or 3217 "pipe". Values 3218 are: 3219 3220 - "read_only" 3221 - "write_only" 3222 - "read_write" 3223 3224 .. TODO:: 3225 3226 Does this apply to 3227 "global_buffer"? 3228 3229 ".actual_access" string The actual memory accesses 3230 performed by the kernel on the 3231 kernel argument. Only present if 3232 ".value_kind" is "global_buffer", 3233 "image", or "pipe". This may be 3234 more restrictive than indicated 3235 by ".access" to reflect what the 3236 kernel actual does. If not 3237 present then the runtime must 3238 assume what is implied by 3239 ".access" and ".is_const" . Values 3240 are: 3241 3242 - "read_only" 3243 - "write_only" 3244 - "read_write" 3245 3246 ".is_const" boolean Indicates if the kernel argument 3247 is const qualified. Only present 3248 if ".value_kind" is 3249 "global_buffer". 3250 3251 ".is_restrict" boolean Indicates if the kernel argument 3252 is restrict qualified. Only 3253 present if ".value_kind" is 3254 "global_buffer". 3255 3256 ".is_volatile" boolean Indicates if the kernel argument 3257 is volatile qualified. Only 3258 present if ".value_kind" is 3259 "global_buffer". 3260 3261 ".is_pipe" boolean Indicates if the kernel argument 3262 is pipe qualified. Only present 3263 if ".value_kind" is "pipe". 3264 3265 .. TODO:: 3266 3267 Can "global_buffer" be pipe 3268 qualified? 3269 3270 ====================== ============== ========= ================================ 3271 3272.. _amdgpu-amdhsa-code-object-metadata-v4: 3273 3274Code Object V4 Metadata 3275+++++++++++++++++++++++ 3276 3277Code object V4 metadata is the same as 3278:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions 3279defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`. 3280 3281 .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3` 3282 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4 3283 3284 ================= ============== ========= ======================================= 3285 String Key Value Type Required? Description 3286 ================= ============== ========= ======================================= 3287 "amdhsa.version" sequence of Required - The first integer is the major 3288 2 integers version. Currently 1. 3289 - The second integer is the minor 3290 version. Currently 1. 3291 "amdhsa.target" string Required The target name of the code using the syntax: 3292 3293 .. code:: 3294 3295 <target-triple> [ "-" <target-id> ] 3296 3297 A canonical target ID must be 3298 used. See :ref:`amdgpu-target-triples` 3299 and :ref:`amdgpu-target-id`. 3300 ================= ============== ========= ======================================= 3301 3302.. 3303 3304Kernel Dispatch 3305~~~~~~~~~~~~~~~ 3306 3307The HSA architected queuing language (AQL) defines a user space memory interface 3308that can be used to control the dispatch of kernels, in an agent independent 3309way. An agent can have zero or more AQL queues created for it using an HSA 3310compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which 3311are 64 bytes) can be placed. See the *HSA Platform System Architecture 3312Specification* [HSA]_ for the AQL queue mechanics and packet layouts. 3313 3314The packet processor of a kernel agent is responsible for detecting and 3315dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 3316packet processor is implemented by the hardware command processor (CP), 3317asynchronous dispatch controller (ADC) and shader processor input controller 3318(SPI). 3319 3320An HSA compatible runtime can be used to allocate an AQL queue object. It uses 3321the kernel mode driver to initialize and register the AQL queue with CP. 3322 3323To dispatch a kernel the following actions are performed. This can occur in the 3324CPU host program, or from an HSA kernel executing on a GPU. 3325 33261. A pointer to an AQL queue for the kernel agent on which the kernel is to be 3327 executed is obtained. 33282. A pointer to the kernel descriptor (see 3329 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. 3330 It must be for a kernel that is contained in a code object that that was 3331 loaded by an HSA compatible runtime on the kernel agent with which the AQL 3332 queue is associated. 33333. Space is allocated for the kernel arguments using the HSA compatible runtime 3334 allocator for a memory region with the kernarg property for the kernel agent 3335 that will execute the kernel. It must be at least 16-byte aligned. 33364. Kernel argument values are assigned to the kernel argument memory 3337 allocation. The layout is defined in the *HSA Programmer's Language 3338 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the 3339 kernel argument memory in the same way constant memory is accessed. (Note 3340 that the HSA specification allows an implementation to copy the kernel 3341 argument contents to another location that is accessed by the kernel.) 33425. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible 3343 runtime api uses 64-bit atomic operations to reserve space in the AQL queue 3344 for the packet. The packet must be set up, and the final write must use an 3345 atomic store release to set the packet kind to ensure the packet contents are 3346 visible to the kernel agent. AQL defines a doorbell signal mechanism to 3347 notify the kernel agent that the AQL queue has been updated. These rules, and 3348 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 3349 System Architecture Specification* [HSA]_. 33506. A kernel dispatch packet includes information about the actual dispatch, 3351 such as grid and work-group size, together with information from the code 3352 object about the kernel, such as segment sizes. The HSA compatible runtime 3353 queries on the kernel symbol can be used to obtain the code object values 3354 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 33557. CP executes micro-code and is responsible for detecting and setting up the 3356 GPU to execute the wavefronts of a kernel dispatch. 33578. CP ensures that when the a wavefront starts executing the kernel machine 3358 code, the scalar general purpose registers (SGPR) and vector general purpose 3359 registers (VGPR) are set up as required by the machine code. The required 3360 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 3361 register state is defined in 3362 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 33639. The prolog of the kernel machine code (see 3364 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 3365 before continuing executing the machine code that corresponds to the kernel. 336610. When the kernel dispatch has completed execution, CP signals the completion 3367 signal specified in the kernel dispatch packet if not 0. 3368 3369.. _amdgpu-amdhsa-memory-spaces: 3370 3371Memory Spaces 3372~~~~~~~~~~~~~ 3373 3374The memory space properties are: 3375 3376 .. table:: AMDHSA Memory Spaces 3377 :name: amdgpu-amdhsa-memory-spaces-table 3378 3379 ================= =========== ======== ======= ================== 3380 Memory Space Name HSA Segment Hardware Address NULL Value 3381 Name Name Size 3382 ================= =========== ======== ======= ================== 3383 Private private scratch 32 0x00000000 3384 Local group LDS 32 0xFFFFFFFF 3385 Global global global 64 0x0000000000000000 3386 Constant constant *same as 64 0x0000000000000000 3387 global* 3388 Generic flat flat 64 0x0000000000000000 3389 Region N/A GDS 32 *not implemented 3390 for AMDHSA* 3391 ================= =========== ======== ======= ================== 3392 3393The global and constant memory spaces both use global virtual addresses, which 3394are the same virtual address space used by the CPU. However, some virtual 3395addresses may only be accessible to the CPU, some only accessible by the GPU, 3396and some by both. 3397 3398Using the constant memory space indicates that the data will not change during 3399the execution of the kernel. This allows scalar read instructions to be 3400used. The vector and scalar L1 caches are invalidated of volatile data before 3401each kernel dispatch execution to allow constant memory to change values between 3402kernel dispatches. 3403 3404The local memory space uses the hardware Local Data Store (LDS) which is 3405automatically allocated when the hardware creates work-groups of wavefronts, and 3406freed when all the wavefronts of a work-group have terminated. The data store 3407(DS) instructions can be used to access it. 3408 3409The private memory space uses the hardware scratch memory support. If the kernel 3410uses scratch, then the hardware allocates memory that is accessed using 3411wavefront lane dword (4 byte) interleaving. The mapping used from private 3412address to physical address is: 3413 3414 ``wavefront-scratch-base + 3415 (private-address * wavefront-size * 4) + 3416 (wavefront-lane-id * 4)`` 3417 3418There are different ways that the wavefront scratch base address is determined 3419by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 3420memory can be accessed in an interleaved manner using buffer instruction with 3421the scratch buffer descriptor and per wavefront scratch offset, by the scratch 3422instructions, or by flat instructions. If each lane of a wavefront accesses the 3423same private address, the interleaving results in adjacent dwords being accessed 3424and hence requires fewer cache lines to be fetched. Multi-dword access is not 3425supported except by flat and scratch instructions in GFX9-GFX10. 3426 3427The generic address space uses the hardware flat address support available in 3428GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and 3429local apertures), that are outside the range of addressible global memory, to 3430map from a flat address to a private or local address. 3431 3432FLAT instructions can take a flat address and access global, private (scratch) 3433and group (LDS) memory depending in if the address is within one of the 3434aperture ranges. Flat access to scratch requires hardware aperture setup and 3435setup in the kernel prologue (see 3436:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires 3437hardware aperture setup and M0 (GFX7-GFX8) register setup (see 3438:ref:`amdgpu-amdhsa-kernel-prolog-m0`). 3439 3440To convert between a segment address and a flat address the base address of the 3441apertures address can be used. For GFX7-GFX8 these are available in the 3442:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 3443Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 3444GFX9-GFX10 the aperture base addresses are directly available as inline constant 3445registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 3446address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 3447which makes it easier to convert from flat to segment or segment to flat. 3448 3449Image and Samplers 3450~~~~~~~~~~~~~~~~~~ 3451 3452Image and sample handles created by an HSA compatible runtime (see 3453:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S# 3454object respectively. In order to support the HSA ``query_sampler`` operations 3455two extra dwords are used to store the HSA BRIG enumeration values for the 3456queries that are not trivially deducible from the S# representation. 3457 3458HSA Signals 3459~~~~~~~~~~~ 3460 3461HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`) 3462are 64-bit addresses of a structure allocated in memory accessible from both the 3463CPU and GPU. The structure is defined by the runtime and subject to change 3464between releases. For example, see [AMD-ROCm-github]_. 3465 3466.. _amdgpu-amdhsa-hsa-aql-queue: 3467 3468HSA AQL Queue 3469~~~~~~~~~~~~~ 3470 3471The HSA AQL queue structure is defined by an HSA compatible runtime (see 3472:ref:`amdgpu-os`) and subject to change between releases. For example, see 3473[AMD-ROCm-github]_. For some processors it contains fields needed to implement 3474certain language features such as the flat address aperture bases. It also 3475contains fields used by CP such as managing the allocation of scratch memory. 3476 3477.. _amdgpu-amdhsa-kernel-descriptor: 3478 3479Kernel Descriptor 3480~~~~~~~~~~~~~~~~~ 3481 3482A kernel descriptor consists of the information needed by CP to initiate the 3483execution of a kernel, including the entry point address of the machine code 3484that implements the kernel. 3485 3486Code Object V3 Kernel Descriptor 3487++++++++++++++++++++++++++++++++ 3488 3489CP microcode requires the Kernel descriptor to be allocated on 64-byte 3490alignment. 3491 3492The fields used by CP for code objects before V3 also match those specified in 3493:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 3494 3495 .. table:: Code Object V3 Kernel Descriptor 3496 :name: amdgpu-amdhsa-kernel-descriptor-v3-table 3497 3498 ======= ======= =============================== ============================ 3499 Bits Size Field Name Description 3500 ======= ======= =============================== ============================ 3501 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 3502 address space memory 3503 required for a work-group 3504 in bytes. This does not 3505 include any dynamically 3506 allocated local address 3507 space memory that may be 3508 added when the kernel is 3509 dispatched. 3510 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 3511 private address space 3512 memory required for a 3513 work-item in bytes. 3514 Additional space may need to 3515 be added to this value if 3516 the call stack has 3517 non-inlined function calls. 3518 95:64 4 bytes KERNARG_SIZE The size of the kernarg 3519 memory pointed to by the 3520 AQL dispatch packet. The 3521 kernarg memory is used to 3522 pass arguments to the 3523 kernel. 3524 3525 * If the kernarg pointer in 3526 the dispatch packet is NULL 3527 then there are no kernel 3528 arguments. 3529 * If the kernarg pointer in 3530 the dispatch packet is 3531 not NULL and this value 3532 is 0 then the kernarg 3533 memory size is 3534 unspecified. 3535 * If the kernarg pointer in 3536 the dispatch packet is 3537 not NULL and this value 3538 is not 0 then the value 3539 specifies the kernarg 3540 memory size in bytes. It 3541 is recommended to provide 3542 a value as it may be used 3543 by CP to optimize making 3544 the kernarg memory 3545 visible to the kernel 3546 code. 3547 3548 127:96 4 bytes Reserved, must be 0. 3549 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 3550 negative) from base 3551 address of kernel 3552 descriptor to kernel's 3553 entry point instruction 3554 which must be 256 byte 3555 aligned. 3556 351:272 20 Reserved, must be 0. 3557 bytes 3558 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-9 3559 Reserved, must be 0. 3560 GFX10 3561 Compute Shader (CS) 3562 program settings used by 3563 CP to set up 3564 ``COMPUTE_PGM_RSRC3`` 3565 configuration 3566 register. See 3567 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 3568 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 3569 program settings used by 3570 CP to set up 3571 ``COMPUTE_PGM_RSRC1`` 3572 configuration 3573 register. See 3574 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 3575 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 3576 program settings used by 3577 CP to set up 3578 ``COMPUTE_PGM_RSRC2`` 3579 configuration 3580 register. See 3581 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 3582 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 3583 _BUFFER SGPR user data registers 3584 (see 3585 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3586 3587 The total number of SGPR 3588 user data registers 3589 requested must not exceed 3590 16 and match value in 3591 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 3592 Any requests beyond 16 3593 will be ignored. 3594 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above* 3595 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above* 3596 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* 3597 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above* 3598 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above* 3599 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above* 3600 _SIZE 3601 457:455 3 bits Reserved, must be 0. 3602 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-9 3603 Reserved, must be 0. 3604 GFX10 3605 - If 0 execute in 3606 wavefront size 64 mode. 3607 - If 1 execute in 3608 native wavefront size 3609 32 mode. 3610 463:459 1 bit Reserved, must be 0. 3611 464 1 bit RESERVED_464 Deprecated, must be 0. 3612 467:465 3 bits Reserved, must be 0. 3613 468 1 bit RESERVED_468 Deprecated, must be 0. 3614 469:471 3 bits Reserved, must be 0. 3615 511:472 5 bytes Reserved, must be 0. 3616 512 **Total size 64 bytes.** 3617 ======= ==================================================================== 3618 3619.. 3620 3621 .. table:: compute_pgm_rsrc1 for GFX6-GFX10 3622 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table 3623 3624 ======= ======= =============================== =========================================================================== 3625 Bits Size Field Name Description 3626 ======= ======= =============================== =========================================================================== 3627 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 3628 blocks used by each work-item; 3629 granularity is device 3630 specific: 3631 3632 GFX6-GFX9 3633 - vgprs_used 0..256 3634 - max(0, ceil(vgprs_used / 4) - 1) 3635 GFX10 (wavefront size 64) 3636 - max_vgpr 1..256 3637 - max(0, ceil(vgprs_used / 4) - 1) 3638 GFX10 (wavefront size 32) 3639 - max_vgpr 1..256 3640 - max(0, ceil(vgprs_used / 8) - 1) 3641 3642 Where vgprs_used is defined 3643 as the highest VGPR number 3644 explicitly referenced plus 3645 one. 3646 3647 Used by CP to set up 3648 ``COMPUTE_PGM_RSRC1.VGPRS``. 3649 3650 The 3651 :ref:`amdgpu-assembler` 3652 calculates this 3653 automatically for the 3654 selected processor from 3655 values provided to the 3656 `.amdhsa_kernel` directive 3657 by the 3658 `.amdhsa_next_free_vgpr` 3659 nested directive (see 3660 :ref:`amdhsa-kernel-directives-table`). 3661 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 3662 blocks used by a wavefront; 3663 granularity is device 3664 specific: 3665 3666 GFX6-GFX8 3667 - sgprs_used 0..112 3668 - max(0, ceil(sgprs_used / 8) - 1) 3669 GFX9 3670 - sgprs_used 0..112 3671 - 2 * max(0, ceil(sgprs_used / 16) - 1) 3672 GFX10 3673 Reserved, must be 0. 3674 (128 SGPRs always 3675 allocated.) 3676 3677 Where sgprs_used is 3678 defined as the highest 3679 SGPR number explicitly 3680 referenced plus one, plus 3681 a target specific number 3682 of additional special 3683 SGPRs for VCC, 3684 FLAT_SCRATCH (GFX7+) and 3685 XNACK_MASK (GFX8+), and 3686 any additional 3687 target specific 3688 limitations. It does not 3689 include the 16 SGPRs added 3690 if a trap handler is 3691 enabled. 3692 3693 The target specific 3694 limitations and special 3695 SGPR layout are defined in 3696 the hardware 3697 documentation, which can 3698 be found in the 3699 :ref:`amdgpu-processors` 3700 table. 3701 3702 Used by CP to set up 3703 ``COMPUTE_PGM_RSRC1.SGPRS``. 3704 3705 The 3706 :ref:`amdgpu-assembler` 3707 calculates this 3708 automatically for the 3709 selected processor from 3710 values provided to the 3711 `.amdhsa_kernel` directive 3712 by the 3713 `.amdhsa_next_free_sgpr` 3714 and `.amdhsa_reserve_*` 3715 nested directives (see 3716 :ref:`amdhsa-kernel-directives-table`). 3717 11:10 2 bits PRIORITY Must be 0. 3718 3719 Start executing wavefront 3720 at the specified priority. 3721 3722 CP is responsible for 3723 filling in 3724 ``COMPUTE_PGM_RSRC1.PRIORITY``. 3725 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 3726 with specified rounding 3727 mode for single (32 3728 bit) floating point 3729 precision floating point 3730 operations. 3731 3732 Floating point rounding 3733 mode values are defined in 3734 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3735 3736 Used by CP to set up 3737 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3738 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 3739 with specified rounding 3740 denorm mode for half/double (16 3741 and 64-bit) floating point 3742 precision floating point 3743 operations. 3744 3745 Floating point rounding 3746 mode values are defined in 3747 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 3748 3749 Used by CP to set up 3750 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3751 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 3752 with specified denorm mode 3753 for single (32 3754 bit) floating point 3755 precision floating point 3756 operations. 3757 3758 Floating point denorm mode 3759 values are defined in 3760 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3761 3762 Used by CP to set up 3763 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3764 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 3765 with specified denorm mode 3766 for half/double (16 3767 and 64-bit) floating point 3768 precision floating point 3769 operations. 3770 3771 Floating point denorm mode 3772 values are defined in 3773 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 3774 3775 Used by CP to set up 3776 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 3777 20 1 bit PRIV Must be 0. 3778 3779 Start executing wavefront 3780 in privilege trap handler 3781 mode. 3782 3783 CP is responsible for 3784 filling in 3785 ``COMPUTE_PGM_RSRC1.PRIV``. 3786 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 3787 with DX10 clamp mode 3788 enabled. Used by the vector 3789 ALU to force DX10 style 3790 treatment of NaN's (when 3791 set, clamp NaN to zero, 3792 otherwise pass NaN 3793 through). 3794 3795 Used by CP to set up 3796 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 3797 22 1 bit DEBUG_MODE Must be 0. 3798 3799 Start executing wavefront 3800 in single step mode. 3801 3802 CP is responsible for 3803 filling in 3804 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 3805 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 3806 with IEEE mode 3807 enabled. Floating point 3808 opcodes that support 3809 exception flag gathering 3810 will quiet and propagate 3811 signaling-NaN inputs per 3812 IEEE 754-2008. Min_dx10 and 3813 max_dx10 become IEEE 3814 754-2008 compliant due to 3815 signaling-NaN propagation 3816 and quieting. 3817 3818 Used by CP to set up 3819 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 3820 24 1 bit BULKY Must be 0. 3821 3822 Only one work-group allowed 3823 to execute on a compute 3824 unit. 3825 3826 CP is responsible for 3827 filling in 3828 ``COMPUTE_PGM_RSRC1.BULKY``. 3829 25 1 bit CDBG_USER Must be 0. 3830 3831 Flag that can be used to 3832 control debugging code. 3833 3834 CP is responsible for 3835 filling in 3836 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 3837 26 1 bit FP16_OVFL GFX6-GFX8 3838 Reserved, must be 0. 3839 GFX9-GFX10 3840 Wavefront starts execution 3841 with specified fp16 overflow 3842 mode. 3843 3844 - If 0, fp16 overflow generates 3845 +/-INF values. 3846 - If 1, fp16 overflow that is the 3847 result of an +/-INF input value 3848 or divide by 0 produces a +/-INF, 3849 otherwise clamps computed 3850 overflow to +/-MAX_FP16 as 3851 appropriate. 3852 3853 Used by CP to set up 3854 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 3855 28:27 2 bits Reserved, must be 0. 3856 29 1 bit WGP_MODE GFX6-GFX9 3857 Reserved, must be 0. 3858 GFX10 3859 - If 0 execute work-groups in 3860 CU wavefront execution mode. 3861 - If 1 execute work-groups on 3862 in WGP wavefront execution mode. 3863 3864 See :ref:`amdgpu-amdhsa-memory-model`. 3865 3866 Used by CP to set up 3867 ``COMPUTE_PGM_RSRC1.WGP_MODE``. 3868 30 1 bit MEM_ORDERED GFX6-9 3869 Reserved, must be 0. 3870 GFX10 3871 Controls the behavior of the 3872 s_waitcnt's vmcnt and vscnt 3873 counters. 3874 3875 - If 0 vmcnt reports completion 3876 of load and atomic with return 3877 out of order with sample 3878 instructions, and the vscnt 3879 reports the completion of 3880 store and atomic without 3881 return in order. 3882 - If 1 vmcnt reports completion 3883 of load, atomic with return 3884 and sample instructions in 3885 order, and the vscnt reports 3886 the completion of store and 3887 atomic without return in order. 3888 3889 Used by CP to set up 3890 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. 3891 31 1 bit FWD_PROGRESS GFX6-9 3892 Reserved, must be 0. 3893 GFX10 3894 - If 0 execute SIMD wavefronts 3895 using oldest first policy. 3896 - If 1 execute SIMD wavefronts to 3897 ensure wavefronts will make some 3898 forward progress. 3899 3900 Used by CP to set up 3901 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 3902 32 **Total size 4 bytes** 3903 ======= =================================================================================================================== 3904 3905.. 3906 3907 .. table:: compute_pgm_rsrc2 for GFX6-GFX10 3908 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table 3909 3910 ======= ======= =============================== =========================================================================== 3911 Bits Size Field Name Description 3912 ======= ======= =============================== =========================================================================== 3913 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 3914 _WAVEFRONT_OFFSET SGPR wavefront scratch offset 3915 system register (see 3916 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3917 3918 Used by CP to set up 3919 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 3920 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 3921 user data registers 3922 requested. This number must 3923 match the number of user 3924 data registers enabled. 3925 3926 Used by CP to set up 3927 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 3928 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 3929 3930 This bit represents 3931 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 3932 which is set by the CP if 3933 the runtime has installed a 3934 trap handler. 3935 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 3936 system SGPR register for 3937 the work-group id in the X 3938 dimension (see 3939 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3940 3941 Used by CP to set up 3942 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 3943 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 3944 system SGPR register for 3945 the work-group id in the Y 3946 dimension (see 3947 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3948 3949 Used by CP to set up 3950 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 3951 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 3952 system SGPR register for 3953 the work-group id in the Z 3954 dimension (see 3955 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3956 3957 Used by CP to set up 3958 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 3959 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 3960 system SGPR register for 3961 work-group information (see 3962 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 3963 3964 Used by CP to set up 3965 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 3966 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 3967 VGPR system registers used 3968 for the work-item ID. 3969 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 3970 defines the values. 3971 3972 Used by CP to set up 3973 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 3974 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 3975 3976 Wavefront starts execution 3977 with address watch 3978 exceptions enabled which 3979 are generated when L1 has 3980 witnessed a thread access 3981 an *address of 3982 interest*. 3983 3984 CP is responsible for 3985 filling in the address 3986 watch bit in 3987 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 3988 according to what the 3989 runtime requests. 3990 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 3991 3992 Wavefront starts execution 3993 with memory violation 3994 exceptions exceptions 3995 enabled which are generated 3996 when a memory violation has 3997 occurred for this wavefront from 3998 L1 or LDS 3999 (write-to-read-only-memory, 4000 mis-aligned atomic, LDS 4001 address out of range, 4002 illegal address, etc.). 4003 4004 CP sets the memory 4005 violation bit in 4006 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 4007 according to what the 4008 runtime requests. 4009 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 4010 4011 CP uses the rounded value 4012 from the dispatch packet, 4013 not this value, as the 4014 dispatch may contain 4015 dynamically allocated group 4016 segment memory. CP writes 4017 directly to 4018 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 4019 4020 Amount of group segment 4021 (LDS) to allocate for each 4022 work-group. Granularity is 4023 device specific: 4024 4025 GFX6: 4026 roundup(lds-size / (64 * 4)) 4027 GFX7-GFX10: 4028 roundup(lds-size / (128 * 4)) 4029 4030 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 4031 _INVALID_OPERATION with specified exceptions 4032 enabled. 4033 4034 Used by CP to set up 4035 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 4036 (set from bits 0..6). 4037 4038 IEEE 754 FP Invalid 4039 Operation 4040 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 4041 _SOURCE input operands is a 4042 denormal number 4043 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 4044 _DIVISION_BY_ZERO Zero 4045 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 4046 _OVERFLOW 4047 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 4048 _UNDERFLOW 4049 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 4050 _INEXACT 4051 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 4052 _ZERO (rcp_iflag_f32 instruction 4053 only) 4054 31 1 bit Reserved, must be 0. 4055 32 **Total size 4 bytes.** 4056 ======= =================================================================================================================== 4057 4058.. 4059 4060 .. table:: compute_pgm_rsrc3 for GFX10 4061 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table 4062 4063 ======= ======= =============================== =========================================================================== 4064 Bits Size Field Name Description 4065 ======= ======= =============================== =========================================================================== 4066 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. 4067 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. 4068 31:4 28 Reserved, must be 0. 4069 bits 4070 32 **Total size 4 bytes.** 4071 ======= =================================================================================================================== 4072 4073.. 4074 4075 .. table:: Floating Point Rounding Mode Enumeration Values 4076 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 4077 4078 ====================================== ===== ============================== 4079 Enumeration Name Value Description 4080 ====================================== ===== ============================== 4081 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 4082 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 4083 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 4084 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 4085 ====================================== ===== ============================== 4086 4087.. 4088 4089 .. table:: Floating Point Denorm Mode Enumeration Values 4090 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 4091 4092 ====================================== ===== ============================== 4093 Enumeration Name Value Description 4094 ====================================== ===== ============================== 4095 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 4096 Denorms 4097 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 4098 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 4099 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 4100 ====================================== ===== ============================== 4101 4102.. 4103 4104 .. table:: System VGPR Work-Item ID Enumeration Values 4105 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 4106 4107 ======================================== ===== ============================ 4108 Enumeration Name Value Description 4109 ======================================== ===== ============================ 4110 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 4111 ID. 4112 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 4113 dimensions ID. 4114 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 4115 dimensions ID. 4116 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 4117 ======================================== ===== ============================ 4118 4119.. _amdgpu-amdhsa-initial-kernel-execution-state: 4120 4121Initial Kernel Execution State 4122~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4123 4124This section defines the register state that will be set up by the packet 4125processor prior to the start of execution of every wavefront. This is limited by 4126the constraints of the hardware controllers of CP/ADC/SPI. 4127 4128The order of the SGPR registers is defined, but the compiler can specify which 4129ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 4130fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4131for enabled registers are dense starting at SGPR0: the first enabled register is 4132SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 4133an SGPR number. 4134 4135The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 4136all wavefronts of the grid. It is possible to specify more than 16 User SGPRs 4137using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are 4138actually initialized. These are then immediately followed by the System SGPRs 4139that are set up by ADC/SPI and can have different values for each wavefront of 4140the grid dispatch. 4141 4142SGPR register initial state is defined in 4143:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 4144 4145 .. table:: SGPR Register Set Up Order 4146 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 4147 4148 ========== ========================== ====== ============================== 4149 SGPR Order Name Number Description 4150 (kernel descriptor enable of 4151 field) SGPRs 4152 ========== ========================== ====== ============================== 4153 First Private Segment Buffer 4 V# that can be used, together 4154 (enable_sgpr_private with Scratch Wavefront Offset 4155 _segment_buffer) as an offset, to access the 4156 private memory space using a 4157 segment address. 4158 4159 CP uses the value provided by 4160 the runtime. 4161 then Dispatch Ptr 2 64-bit address of AQL dispatch 4162 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 4163 actually executing. 4164 then Queue Ptr 2 64-bit address of amd_queue_t 4165 (enable_sgpr_queue_ptr) object for AQL queue on which 4166 the dispatch packet was 4167 queued. 4168 then Kernarg Segment Ptr 2 64-bit address of Kernarg 4169 (enable_sgpr_kernarg segment. This is directly 4170 _segment_ptr) copied from the 4171 kernarg_address in the kernel 4172 dispatch packet. 4173 4174 Having CP load it once avoids 4175 loading it at the beginning of 4176 every wavefront. 4177 then Dispatch Id 2 64-bit Dispatch ID of the 4178 (enable_sgpr_dispatch_id) dispatch packet being 4179 executed. 4180 then Flat Scratch Init 2 This is 2 SGPRs: 4181 (enable_sgpr_flat_scratch 4182 _init) GFX6 4183 Not supported. 4184 GFX7-GFX8 4185 The first SGPR is a 32-bit 4186 byte offset from 4187 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 4188 to per SPI base of memory 4189 for scratch for the queue 4190 executing the kernel 4191 dispatch. CP obtains this 4192 from the runtime. (The 4193 Scratch Segment Buffer base 4194 address is 4195 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 4196 plus this offset.) The value 4197 of Scratch Wavefront Offset must 4198 be added to this offset by 4199 the kernel machine code, 4200 right shifted by 8, and 4201 moved to the FLAT_SCRATCH_HI 4202 SGPR register. 4203 FLAT_SCRATCH_HI corresponds 4204 to SGPRn-4 on GFX7, and 4205 SGPRn-6 on GFX8 (where SGPRn 4206 is the highest numbered SGPR 4207 allocated to the wavefront). 4208 FLAT_SCRATCH_HI is 4209 multiplied by 256 (as it is 4210 in units of 256 bytes) and 4211 added to 4212 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 4213 to calculate the per wavefront 4214 FLAT SCRATCH BASE in flat 4215 memory instructions that 4216 access the scratch 4217 aperture. 4218 4219 The second SGPR is 32-bit 4220 byte size of a single 4221 work-item's scratch memory 4222 usage. CP obtains this from 4223 the runtime, and it is 4224 always a multiple of DWORD. 4225 CP checks that the value in 4226 the kernel dispatch packet 4227 Private Segment Byte Size is 4228 not larger and requests the 4229 runtime to increase the 4230 queue's scratch size if 4231 necessary. The kernel code 4232 must move it to 4233 FLAT_SCRATCH_LO which is 4234 SGPRn-3 on GFX7 and SGPRn-5 4235 on GFX8. FLAT_SCRATCH_LO is 4236 used as the FLAT SCRATCH 4237 SIZE in flat memory 4238 instructions. Having CP load 4239 it once avoids loading it at 4240 the beginning of every 4241 wavefront. 4242 GFX9-GFX10 4243 This is the 4244 64-bit base address of the 4245 per SPI scratch backing 4246 memory managed by SPI for 4247 the queue executing the 4248 kernel dispatch. CP obtains 4249 this from the runtime (and 4250 divides it if there are 4251 multiple Shader Arrays each 4252 with its own SPI). The value 4253 of Scratch Wavefront Offset must 4254 be added by the kernel 4255 machine code and the result 4256 moved to the FLAT_SCRATCH 4257 SGPR which is SGPRn-6 and 4258 SGPRn-5. It is used as the 4259 FLAT SCRATCH BASE in flat 4260 memory instructions. 4261 then Private Segment Size 1 The 32-bit byte size of a 4262 (enable_sgpr_private single 4263 work-item's 4264 scratch_segment_size) memory 4265 allocation. This is the 4266 value from the kernel 4267 dispatch packet Private 4268 Segment Byte Size rounded up 4269 by CP to a multiple of 4270 DWORD. 4271 4272 Having CP load it once avoids 4273 loading it at the beginning of 4274 every wavefront. 4275 4276 This is not used for 4277 GFX7-GFX8 since it is the same 4278 value as the second SGPR of 4279 Flat Scratch Init. However, it 4280 may be needed for GFX9-GFX10 which 4281 changes the meaning of the 4282 Flat Scratch Init value. 4283 then Grid Work-Group Count X 1 32-bit count of the number of 4284 (enable_sgpr_grid work-groups in the X dimension 4285 _workgroup_count_X) for the grid being 4286 executed. Computed from the 4287 fields in the kernel dispatch 4288 packet as ((grid_size.x + 4289 workgroup_size.x - 1) / 4290 workgroup_size.x). 4291 then Grid Work-Group Count Y 1 32-bit count of the number of 4292 (enable_sgpr_grid work-groups in the Y dimension 4293 _workgroup_count_Y && for the grid being 4294 less than 16 previous executed. Computed from the 4295 SGPRs) fields in the kernel dispatch 4296 packet as ((grid_size.y + 4297 workgroup_size.y - 1) / 4298 workgroupSize.y). 4299 4300 Only initialized if <16 4301 previous SGPRs initialized. 4302 then Grid Work-Group Count Z 1 32-bit count of the number of 4303 (enable_sgpr_grid work-groups in the Z dimension 4304 _workgroup_count_Z && for the grid being 4305 less than 16 previous executed. Computed from the 4306 SGPRs) fields in the kernel dispatch 4307 packet as ((grid_size.z + 4308 workgroup_size.z - 1) / 4309 workgroupSize.z). 4310 4311 Only initialized if <16 4312 previous SGPRs initialized. 4313 then Work-Group Id X 1 32-bit work-group id in X 4314 (enable_sgpr_workgroup_id dimension of grid for 4315 _X) wavefront. 4316 then Work-Group Id Y 1 32-bit work-group id in Y 4317 (enable_sgpr_workgroup_id dimension of grid for 4318 _Y) wavefront. 4319 then Work-Group Id Z 1 32-bit work-group id in Z 4320 (enable_sgpr_workgroup_id dimension of grid for 4321 _Z) wavefront. 4322 then Work-Group Info 1 {first_wavefront, 14'b0000, 4323 (enable_sgpr_workgroup ordered_append_term[10:0], 4324 _info) threadgroup_size_in_wavefronts[5:0]} 4325 then Scratch Wavefront Offset 1 32-bit byte offset from base 4326 (enable_sgpr_private of scratch base of queue 4327 _segment_wavefront_offset) executing the kernel 4328 dispatch. Must be used as an 4329 offset with Private 4330 segment address when using 4331 Scratch Segment Buffer. It 4332 must be used to set up FLAT 4333 SCRATCH for flat addressing 4334 (see 4335 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). 4336 ========== ========================== ====== ============================== 4337 4338The order of the VGPR registers is defined, but the compiler can specify which 4339ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 4340fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 4341for enabled registers are dense starting at VGPR0: the first enabled register is 4342VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 4343VGPR number. 4344 4345VGPR register initial state is defined in 4346:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. 4347 4348 .. table:: VGPR Register Set Up Order 4349 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table 4350 4351 ========== ========================== ====== ============================== 4352 VGPR Order Name Number Description 4353 (kernel descriptor enable of 4354 field) VGPRs 4355 ========== ========================== ====== ============================== 4356 First Work-Item Id X 1 32-bit work-item id in X 4357 (Always initialized) dimension of work-group for 4358 wavefront lane. 4359 then Work-Item Id Y 1 32-bit work-item id in Y 4360 (enable_vgpr_workitem_id dimension of work-group for 4361 > 0) wavefront lane. 4362 then Work-Item Id Z 1 32-bit work-item id in Z 4363 (enable_vgpr_workitem_id dimension of work-group for 4364 > 1) wavefront lane. 4365 ========== ========================== ====== ============================== 4366 4367The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 4368 43691. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 4370 registers. 43712. Work-group Id registers X, Y, Z are set by ADC which supports any 4372 combination including none. 43733. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 4374 its value cannot be included with the flat scratch init value which is per 4375 queue. 43764. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 4377 or (X, Y, Z). 4378 4379Flat Scratch register pair are adjacent SGPRs so they can be moved as a 64-bit 4380value to the hardware required SGPRn-3 and SGPRn-4 respectively. 4381 4382The global segment can be accessed either using buffer instructions (GFX6 which 4383has V# 64-bit address support), flat instructions (GFX7-GFX10), or global 4384instructions (GFX9-GFX10). 4385 4386If buffer operations are used, then the compiler can generate a V# with the 4387following properties: 4388 4389* base address of 0 4390* no swizzle 4391* ATC: 1 if IOMMU present (such as APU) 4392* ptr64: 1 4393* MTYPE set to support memory coherence that matches the runtime (such as CC for 4394 APU and NC for dGPU). 4395 4396.. _amdgpu-amdhsa-kernel-prolog: 4397 4398Kernel Prolog 4399~~~~~~~~~~~~~ 4400 4401The compiler performs initialization in the kernel prologue depending on the 4402target and information about things like stack usage in the kernel and called 4403functions. Some of this initialization requires the compiler to request certain 4404User and System SGPRs be present in the 4405:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the 4406:ref:`amdgpu-amdhsa-kernel-descriptor`. 4407 4408.. _amdgpu-amdhsa-kernel-prolog-cfi: 4409 4410CFI 4411+++ 4412 44131. The CFI return address is undefined. 4414 44152. The CFI CFA is defined using an expression which evaluates to a location 4416 description that comprises one memory location description for the 4417 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. 4418 4419.. _amdgpu-amdhsa-kernel-prolog-m0: 4420 4421M0 4422++ 4423 4424GFX6-GFX8 4425 The M0 register must be initialized with a value at least the total LDS size 4426 if the kernel may access LDS via DS or flat operations. Total LDS size is 4427 available in dispatch packet. For M0, it is also possible to use maximum 4428 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 4429 GFX7-GFX8). 4430GFX9-GFX10 4431 The M0 register is not used for range checking LDS accesses and so does not 4432 need to be initialized in the prolog. 4433 4434.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: 4435 4436Stack Pointer 4437+++++++++++++ 4438 4439If the kernel has function calls it must set up the ABI stack pointer described 4440in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting 4441SGPR32 to the unswizzled scratch offset of the address past the last local 4442allocation. 4443 4444.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: 4445 4446Frame Pointer 4447+++++++++++++ 4448 4449If the kernel needs a frame pointer for the reasons defined in 4450``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the 4451kernel prolog. If a frame pointer is not required then all uses of the frame 4452pointer are replaced with immediate ``0`` offsets. 4453 4454.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: 4455 4456Flat Scratch 4457++++++++++++ 4458 4459If the kernel or any function it calls may use flat operations to access 4460scratch memory, the prolog code must set up the FLAT_SCRATCH register pair 4461(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization 4462uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see 4463:ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 4464 4465GFX6 4466 Flat scratch is not supported. 4467 4468GFX7-GFX8 4469 4470 1. The low word of Flat Scratch Init is 32-bit byte offset from 4471 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 4472 being managed by SPI for the queue executing the kernel dispatch. This is 4473 the same value used in the Scratch Segment Buffer V# base address. The 4474 prolog must add the value of Scratch Wavefront Offset to get the 4475 wavefront's byte scratch backing memory offset from 4476 ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since FLAT_SCRATCH_LO is in units of 256 4477 bytes, the offset must be right shifted by 8 before moving into 4478 FLAT_SCRATCH_LO. 4479 2. The second word of Flat Scratch Init is 32-bit byte size of a single 4480 work-items scratch memory usage. This is directly loaded from the kernel 4481 dispatch packet Private Segment Byte Size and rounded up to a multiple of 4482 DWORD. Having CP load it once avoids loading it at the beginning of every 4483 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT 4484 SCRATCH SIZE. 4485 4486GFX9-GFX10 4487 The Flat Scratch Init is the 64-bit address of the base of scratch backing 4488 memory being managed by SPI for the queue executing the kernel dispatch. The 4489 prolog must add the value of Scratch Wavefront Offset and moved to the 4490 FLAT_SCRATCH pair for use as the flat scratch base in flat memory 4491 instructions. 4492 4493.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: 4494 4495Private Segment Buffer 4496++++++++++++++++++++++ 4497 4498A set of four SGPRs beginning at a four-aligned SGPR index are always selected 4499to serve as the scratch V# for the kernel as follows: 4500 4501 - If it is known during instruction selection that there is stack usage, 4502 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if 4503 optimizations are disabled (``-O0``), if stack objects already exist (for 4504 locals, etc.), or if there are any function calls. 4505 4506 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index 4507 are reserved for the tentative scratch V#. These will be used if it is 4508 determined that spilling is needed. 4509 4510 - If no use is made of the tentative scratch V#, then it is unreserved, 4511 and the register count is determined ignoring it. 4512 - If use is made of the tentative scratch V#, then its register numbers 4513 are shifted to the first four-aligned SGPR index after the highest one 4514 allocated by the register allocator, and all uses are updated. The 4515 register count includes them in the shifted location. 4516 - In either case, if the processor has the SGPR allocation bug, the 4517 tentative allocation is not shifted or unreserved in order to ensure 4518 the register count is higher to workaround the bug. 4519 4520 .. note:: 4521 4522 This approach of using a tentative scratch V# and shifting the register 4523 numbers if used avoids having to perform register allocation a second 4524 time if the tentative V# is eliminated. This is more efficient and 4525 avoids the problem that the second register allocation may perform 4526 spilling which will fail as there is no longer a scratch V#. 4527 4528When the kernel prolog code is being emitted it is known whether the scratch V# 4529described above is actually used. If it is, the prolog code must set it up by 4530copying the Private Segment Buffer to the scratch V# registers and then adding 4531the Private Segment Wavefront Offset to the queue base address in the V#. The 4532result is a V# with a base address pointing to the beginning of the wavefront 4533scratch backing memory. 4534 4535The Private Segment Buffer is always requested, but the Private Segment 4536Wavefront Offset is only requested if it is used (see 4537:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 4538 4539.. _amdgpu-amdhsa-memory-model: 4540 4541Memory Model 4542~~~~~~~~~~~~ 4543 4544This section describes the mapping of the LLVM memory model onto AMDGPU machine 4545code (see :ref:`memmodel`). 4546 4547The AMDGPU backend supports the memory synchronization scopes specified in 4548:ref:`amdgpu-memory-scopes`. 4549 4550The code sequences used to implement the memory model specify the order of 4551instructions that a single thread must execute. The ``s_waitcnt`` and cache 4552management instructions such as ``buffer_wbinvl1_vol`` are defined with respect 4553to other memory instructions executed by the same thread. This allows them to be 4554moved earlier or later which can allow them to be combined with other instances 4555of the same instruction, or hoisted/sunk out of loops to improve performance. 4556Only the instructions related to the memory model are given; additional 4557``s_waitcnt`` instructions are required to ensure registers are defined before 4558being used. These may be able to be combined with the memory model ``s_waitcnt`` 4559instructions as described above. 4560 4561The AMDGPU backend supports the following memory models: 4562 4563 HSA Memory Model [HSA]_ 4564 The HSA memory model uses a single happens-before relation for all address 4565 spaces (see :ref:`amdgpu-address-spaces`). 4566 OpenCL Memory Model [OpenCL]_ 4567 The OpenCL memory model which has separate happens-before relations for the 4568 global and local address spaces. Only a fence specifying both global and 4569 local address space, and seq_cst instructions join the relationships. Since 4570 the LLVM ``memfence`` instruction does not allow an address space to be 4571 specified the OpenCL fence has to conservatively assume both local and 4572 global address space was specified. However, optimizations can often be 4573 done to eliminate the additional ``s_waitcnt`` instructions when there are 4574 no intervening memory instructions which access the corresponding address 4575 space. The code sequences in the table indicate what can be omitted for the 4576 OpenCL memory. The target triple environment is used to determine if the 4577 source language is OpenCL (see :ref:`amdgpu-opencl`). 4578 4579``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 4580operations. 4581 4582``buffer/global/flat_load/store/atomic`` instructions to global memory are 4583termed vector memory operations. 4584 4585Private address space uses ``buffer_load/store`` using the scratch V# 4586(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread 4587is accessing the memory, atomic memory orderings are not meaningful, and all 4588accesses are treated as non-atomic. 4589 4590Constant address space uses ``buffer/global_load`` instructions (or equivalent 4591scalar memory instructions). Since the constant address space contents do not 4592change during the execution of a kernel dispatch it is not legal to perform 4593stores, and atomic memory orderings are not meaningful, and all accesses are 4594treated as non-atomic. 4595 4596A memory synchronization scope wider than work-group is not meaningful for the 4597group (LDS) address space and is treated as work-group. 4598 4599The memory model does not support the region address space which is treated as 4600non-atomic. 4601 4602Acquire memory ordering is not meaningful on store atomic instructions and is 4603treated as non-atomic. 4604 4605Release memory ordering is not meaningful on load atomic instructions and is 4606treated a non-atomic. 4607 4608Acquire-release memory ordering is not meaningful on load or store atomic 4609instructions and is treated as acquire and release respectively. 4610 4611The memory order also adds the single thread optimization constraints defined in 4612table 4613:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. 4614 4615 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints 4616 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table 4617 4618 ============ ============================================================== 4619 LLVM Memory Optimization Constraints 4620 Ordering 4621 ============ ============================================================== 4622 unordered *none* 4623 monotonic *none* 4624 acquire - If a load atomic/atomicrmw then no following load/load 4625 atomic/store/store atomic/atomicrmw/fence instruction can be 4626 moved before the acquire. 4627 - If a fence then same as load atomic, plus no preceding 4628 associated fence-paired-atomic can be moved after the fence. 4629 release - If a store atomic/atomicrmw then no preceding load/load 4630 atomic/store/store atomic/atomicrmw/fence instruction can be 4631 moved after the release. 4632 - If a fence then same as store atomic, plus no following 4633 associated fence-paired-atomic can be moved before the 4634 fence. 4635 acq_rel Same constraints as both acquire and release. 4636 seq_cst - If a load atomic then same constraints as acquire, plus no 4637 preceding sequentially consistent load atomic/store 4638 atomic/atomicrmw/fence instruction can be moved after the 4639 seq_cst. 4640 - If a store atomic then the same constraints as release, plus 4641 no following sequentially consistent load atomic/store 4642 atomic/atomicrmw/fence instruction can be moved before the 4643 seq_cst. 4644 - If an atomicrmw/fence then same constraints as acq_rel. 4645 ============ ============================================================== 4646 4647The code sequences used to implement the memory model are defined in the 4648following sections: 4649 4650* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` 4651* :ref:`amdgpu-amdhsa-memory-model-gfx10` 4652 4653.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: 4654 4655Memory Model GFX6-GFX9 4656++++++++++++++++++++++ 4657 4658For GFX6-GFX9: 4659 4660* Each agent has multiple shader arrays (SA). 4661* Each SA has multiple compute units (CU). 4662* Each CU has multiple SIMDs that execute wavefronts. 4663* The wavefronts for a single work-group are executed in the same CU but may be 4664 executed by different SIMDs. 4665* Each CU has a single LDS memory shared by the wavefronts of the work-groups 4666 executing on it. 4667* All LDS operations of a CU are performed as wavefront wide operations in a 4668 global order and involve no caching. Completion is reported to a wavefront in 4669 execution order. 4670* The LDS memory has multiple request queues shared by the SIMDs of a 4671 CU. Therefore, the LDS operations performed by different wavefronts of a 4672 work-group can be reordered relative to each other, which can result in 4673 reordering the visibility of vector memory operations with respect to LDS 4674 operations of other wavefronts in the same work-group. A ``s_waitcnt 4675 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 4676 vector memory operations between wavefronts of a work-group, but not between 4677 operations performed by the same wavefront. 4678* The vector memory operations are performed as wavefront wide operations and 4679 completion is reported to a wavefront in execution order. The exception is 4680 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 4681 vector memory order if they access LDS memory, and out of LDS operation order 4682 if they access global memory. 4683* The vector memory operations access a single vector L1 cache shared by all 4684 SIMDs a CU. Therefore, no special action is required for coherence between the 4685 lanes of a single wavefront, or for coherence between wavefronts in the same 4686 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between 4687 wavefronts executing in different work-groups as they may be executing on 4688 different CUs. 4689* The scalar memory operations access a scalar L1 cache shared by all wavefronts 4690 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 4691 scalar operations are used in a restricted way so do not impact the memory 4692 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 4693* The vector and scalar memory operations use an L2 cache shared by all CUs on 4694 the same agent. 4695* The L2 cache has independent channels to service disjoint ranges of virtual 4696 addresses. 4697* Each CU has a separate request queue per channel. Therefore, the vector and 4698 scalar memory operations performed by wavefronts executing in different 4699 work-groups (which may be executing on different CUs) of an agent can be 4700 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to 4701 ensure synchronization between vector memory operations of different CUs. It 4702 ensures a previous vector memory operation has completed before executing a 4703 subsequent vector memory or LDS operation and so can be used to meet the 4704 requirements of acquire and release. 4705* The L2 cache can be kept coherent with other agents on some targets, or ranges 4706 of virtual addresses can be set up to bypass it to ensure system coherence. 4707 4708Scalar memory operations are only used to access memory that is proven to not 4709change during the execution of the kernel dispatch. This includes constant 4710address space and global address space for program scope ``const`` variables. 4711Therefore, the kernel machine code does not have to maintain the scalar cache to 4712ensure it is coherent with the vector caches. The scalar and vector caches are 4713invalidated between kernel dispatches by CP since constant address space data 4714may change between kernel dispatch executions. See 4715:ref:`amdgpu-amdhsa-memory-spaces`. 4716 4717The one exception is if scalar writes are used to spill SGPR registers. In this 4718case the AMDGPU backend ensures the memory location used to spill is never 4719accessed by vector memory operations at the same time. If scalar writes are used 4720then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 4721return since the locations may be used for vector memory instructions by a 4722future wavefront that uses the same scratch area, or a function call that 4723creates a frame at the same address, respectively. There is no need for a 4724``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 4725 4726For kernarg backing memory: 4727 4728* CP invalidates the L1 cache at the start of each kernel dispatch. 4729* On dGPU the kernarg backing memory is allocated in host memory accessed as 4730 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also 4731 causes it to be treated as non-volatile and so is not invalidated by 4732 ``*_vol``. 4733* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) 4734 and so the L2 cache will be coherent with the CPU and other agents. 4735 4736Scratch backing memory (which is used for the private address space) is accessed 4737with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is 4738only accessed by a single thread, and is always write-before-read, there is 4739never a need to invalidate these entries from the L1 cache. Hence all cache 4740invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. 4741 4742The code sequences used to implement the memory model for GFX6-GFX9 are defined 4743in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 4744 4745 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 4746 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 4747 4748 ============ ============ ============== ========== ================================ 4749 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 4750 Ordering Sync Scope Address GFX6-9 4751 Space 4752 ============ ============ ============== ========== ================================ 4753 **Non-Atomic** 4754 ------------------------------------------------------------------------------------ 4755 load *none* *none* - global - !volatile & !nontemporal 4756 - generic 4757 - private 1. buffer/global/flat_load 4758 - constant 4759 - volatile & !nontemporal 4760 4761 1. buffer/global/flat_load 4762 glc=1 4763 4764 - nontemporal 4765 4766 1. buffer/global/flat_load 4767 glc=1 slc=1 4768 4769 load *none* *none* - local 1. ds_load 4770 store *none* *none* - global - !nontemporal 4771 - generic 4772 - private 1. buffer/global/flat_store 4773 - constant 4774 - nontemporal 4775 4776 1. buffer/global/flat_store 4777 glc=1 slc=1 4778 4779 store *none* *none* - local 1. ds_store 4780 **Unordered Atomic** 4781 ------------------------------------------------------------------------------------ 4782 load atomic unordered *any* *any* *Same as non-atomic*. 4783 store atomic unordered *any* *any* *Same as non-atomic*. 4784 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 4785 **Monotonic Atomic** 4786 ------------------------------------------------------------------------------------ 4787 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load 4788 - wavefront - local 4789 - workgroup - generic 4790 load atomic monotonic - agent - global 1. buffer/global/flat_load 4791 - system - generic glc=1 4792 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 4793 - wavefront - generic 4794 - workgroup 4795 - agent 4796 - system 4797 store atomic monotonic - singlethread - local 1. ds_store 4798 - wavefront 4799 - workgroup 4800 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 4801 - wavefront - generic 4802 - workgroup 4803 - agent 4804 - system 4805 atomicrmw monotonic - singlethread - local 1. ds_atomic 4806 - wavefront 4807 - workgroup 4808 **Acquire Atomic** 4809 ------------------------------------------------------------------------------------ 4810 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 4811 - wavefront - local 4812 - generic 4813 load atomic acquire - workgroup - global 1. buffer/global_load 4814 load atomic acquire - workgroup - local 1. ds/flat_load 4815 - generic 2. s_waitcnt lgkmcnt(0) 4816 4817 - If OpenCL, omit. 4818 - Must happen before 4819 any following 4820 global/generic 4821 load/load 4822 atomic/store/store 4823 atomic/atomicrmw. 4824 - Ensures any 4825 following global 4826 data read is no 4827 older than a local load 4828 atomic value being 4829 acquired. 4830 4831 load atomic acquire - agent - global 1. buffer/global_load 4832 - system glc=1 4833 2. s_waitcnt vmcnt(0) 4834 4835 - Must happen before 4836 following 4837 buffer_wbinvl1_vol. 4838 - Ensures the load 4839 has completed 4840 before invalidating 4841 the cache. 4842 4843 3. buffer_wbinvl1_vol 4844 4845 - Must happen before 4846 any following 4847 global/generic 4848 load/load 4849 atomic/atomicrmw. 4850 - Ensures that 4851 following 4852 loads will not see 4853 stale global data. 4854 4855 load atomic acquire - agent - generic 1. flat_load glc=1 4856 - system 2. s_waitcnt vmcnt(0) & 4857 lgkmcnt(0) 4858 4859 - If OpenCL omit 4860 lgkmcnt(0). 4861 - Must happen before 4862 following 4863 buffer_wbinvl1_vol. 4864 - Ensures the flat_load 4865 has completed 4866 before invalidating 4867 the cache. 4868 4869 3. buffer_wbinvl1_vol 4870 4871 - Must happen before 4872 any following 4873 global/generic 4874 load/load 4875 atomic/atomicrmw. 4876 - Ensures that 4877 following loads 4878 will not see stale 4879 global data. 4880 4881 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 4882 - wavefront - local 4883 - generic 4884 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 4885 atomicrmw acquire - workgroup - local 1. ds/flat_atomic 4886 - generic 2. s_waitcnt lgkmcnt(0) 4887 4888 - If OpenCL, omit. 4889 - Must happen before 4890 any following 4891 global/generic 4892 load/load 4893 atomic/store/store 4894 atomic/atomicrmw. 4895 - Ensures any 4896 following global 4897 data read is no 4898 older than a local 4899 atomicrmw value 4900 being acquired. 4901 4902 atomicrmw acquire - agent - global 1. buffer/global_atomic 4903 - system 2. s_waitcnt vmcnt(0) 4904 4905 - Must happen before 4906 following 4907 buffer_wbinvl1_vol. 4908 - Ensures the 4909 atomicrmw has 4910 completed before 4911 invalidating the 4912 cache. 4913 4914 3. buffer_wbinvl1_vol 4915 4916 - Must happen before 4917 any following 4918 global/generic 4919 load/load 4920 atomic/atomicrmw. 4921 - Ensures that 4922 following loads 4923 will not see stale 4924 global data. 4925 4926 atomicrmw acquire - agent - generic 1. flat_atomic 4927 - system 2. s_waitcnt vmcnt(0) & 4928 lgkmcnt(0) 4929 4930 - If OpenCL, omit 4931 lgkmcnt(0). 4932 - Must happen before 4933 following 4934 buffer_wbinvl1_vol. 4935 - Ensures the 4936 atomicrmw has 4937 completed before 4938 invalidating the 4939 cache. 4940 4941 3. buffer_wbinvl1_vol 4942 4943 - Must happen before 4944 any following 4945 global/generic 4946 load/load 4947 atomic/atomicrmw. 4948 - Ensures that 4949 following loads 4950 will not see stale 4951 global data. 4952 4953 fence acquire - singlethread *none* *none* 4954 - wavefront 4955 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 4956 4957 - If OpenCL and 4958 address space is 4959 not generic, omit. 4960 - However, since LLVM 4961 currently has no 4962 address space on 4963 the fence need to 4964 conservatively 4965 always generate. If 4966 fence had an 4967 address space then 4968 set to address 4969 space of OpenCL 4970 fence flag, or to 4971 generic if both 4972 local and global 4973 flags are 4974 specified. 4975 - Must happen after 4976 any preceding 4977 local/generic load 4978 atomic/atomicrmw 4979 with an equal or 4980 wider sync scope 4981 and memory ordering 4982 stronger than 4983 unordered (this is 4984 termed the 4985 fence-paired-atomic). 4986 - Must happen before 4987 any following 4988 global/generic 4989 load/load 4990 atomic/store/store 4991 atomic/atomicrmw. 4992 - Ensures any 4993 following global 4994 data read is no 4995 older than the 4996 value read by the 4997 fence-paired-atomic. 4998 4999 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 5000 - system vmcnt(0) 5001 5002 - If OpenCL and 5003 address space is 5004 not generic, omit 5005 lgkmcnt(0). 5006 - However, since LLVM 5007 currently has no 5008 address space on 5009 the fence need to 5010 conservatively 5011 always generate 5012 (see comment for 5013 previous fence). 5014 - Could be split into 5015 separate s_waitcnt 5016 vmcnt(0) and 5017 s_waitcnt 5018 lgkmcnt(0) to allow 5019 them to be 5020 independently moved 5021 according to the 5022 following rules. 5023 - s_waitcnt vmcnt(0) 5024 must happen after 5025 any preceding 5026 global/generic load 5027 atomic/atomicrmw 5028 with an equal or 5029 wider sync scope 5030 and memory ordering 5031 stronger than 5032 unordered (this is 5033 termed the 5034 fence-paired-atomic). 5035 - s_waitcnt lgkmcnt(0) 5036 must happen after 5037 any preceding 5038 local/generic load 5039 atomic/atomicrmw 5040 with an equal or 5041 wider sync scope 5042 and memory ordering 5043 stronger than 5044 unordered (this is 5045 termed the 5046 fence-paired-atomic). 5047 - Must happen before 5048 the following 5049 buffer_wbinvl1_vol. 5050 - Ensures that the 5051 fence-paired atomic 5052 has completed 5053 before invalidating 5054 the 5055 cache. Therefore 5056 any following 5057 locations read must 5058 be no older than 5059 the value read by 5060 the 5061 fence-paired-atomic. 5062 5063 2. buffer_wbinvl1_vol 5064 5065 - Must happen before any 5066 following global/generic 5067 load/load 5068 atomic/store/store 5069 atomic/atomicrmw. 5070 - Ensures that 5071 following loads 5072 will not see stale 5073 global data. 5074 5075 **Release Atomic** 5076 ------------------------------------------------------------------------------------ 5077 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 5078 - wavefront - local 5079 - generic 5080 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5081 - generic 5082 - If OpenCL, omit. 5083 - Must happen after 5084 any preceding 5085 local/generic 5086 load/store/load 5087 atomic/store 5088 atomic/atomicrmw. 5089 - Must happen before 5090 the following 5091 store. 5092 - Ensures that all 5093 memory operations 5094 to local have 5095 completed before 5096 performing the 5097 store that is being 5098 released. 5099 5100 2. buffer/global/flat_store 5101 store atomic release - workgroup - local 1. ds_store 5102 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 5103 - system - generic vmcnt(0) 5104 5105 - If OpenCL and 5106 address space is 5107 not generic, omit 5108 lgkmcnt(0). 5109 - Could be split into 5110 separate s_waitcnt 5111 vmcnt(0) and 5112 s_waitcnt 5113 lgkmcnt(0) to allow 5114 them to be 5115 independently moved 5116 according to the 5117 following rules. 5118 - s_waitcnt vmcnt(0) 5119 must happen after 5120 any preceding 5121 global/generic 5122 load/store/load 5123 atomic/store 5124 atomic/atomicrmw. 5125 - s_waitcnt lgkmcnt(0) 5126 must happen after 5127 any preceding 5128 local/generic 5129 load/store/load 5130 atomic/store 5131 atomic/atomicrmw. 5132 - Must happen before 5133 the following 5134 store. 5135 - Ensures that all 5136 memory operations 5137 to memory have 5138 completed before 5139 performing the 5140 store that is being 5141 released. 5142 5143 2. buffer/global/flat_store 5144 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 5145 - wavefront - local 5146 - generic 5147 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 5148 - generic 5149 - If OpenCL, omit. 5150 - Must happen after 5151 any preceding 5152 local/generic 5153 load/store/load 5154 atomic/store 5155 atomic/atomicrmw. 5156 - Must happen before 5157 the following 5158 atomicrmw. 5159 - Ensures that all 5160 memory operations 5161 to local have 5162 completed before 5163 performing the 5164 atomicrmw that is 5165 being released. 5166 5167 2. buffer/global/flat_atomic 5168 atomicrmw release - workgroup - local 1. ds_atomic 5169 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 5170 - system - generic vmcnt(0) 5171 5172 - If OpenCL, omit 5173 lgkmcnt(0). 5174 - Could be split into 5175 separate s_waitcnt 5176 vmcnt(0) and 5177 s_waitcnt 5178 lgkmcnt(0) to allow 5179 them to be 5180 independently moved 5181 according to the 5182 following rules. 5183 - s_waitcnt vmcnt(0) 5184 must happen after 5185 any preceding 5186 global/generic 5187 load/store/load 5188 atomic/store 5189 atomic/atomicrmw. 5190 - s_waitcnt lgkmcnt(0) 5191 must happen after 5192 any preceding 5193 local/generic 5194 load/store/load 5195 atomic/store 5196 atomic/atomicrmw. 5197 - Must happen before 5198 the following 5199 atomicrmw. 5200 - Ensures that all 5201 memory operations 5202 to global and local 5203 have completed 5204 before performing 5205 the atomicrmw that 5206 is being released. 5207 5208 2. buffer/global/flat_atomic 5209 fence release - singlethread *none* *none* 5210 - wavefront 5211 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5212 5213 - If OpenCL and 5214 address space is 5215 not generic, omit. 5216 - However, since LLVM 5217 currently has no 5218 address space on 5219 the fence need to 5220 conservatively 5221 always generate. If 5222 fence had an 5223 address space then 5224 set to address 5225 space of OpenCL 5226 fence flag, or to 5227 generic if both 5228 local and global 5229 flags are 5230 specified. 5231 - Must happen after 5232 any preceding 5233 local/generic 5234 load/load 5235 atomic/store/store 5236 atomic/atomicrmw. 5237 - Must happen before 5238 any following store 5239 atomic/atomicrmw 5240 with an equal or 5241 wider sync scope 5242 and memory ordering 5243 stronger than 5244 unordered (this is 5245 termed the 5246 fence-paired-atomic). 5247 - Ensures that all 5248 memory operations 5249 to local have 5250 completed before 5251 performing the 5252 following 5253 fence-paired-atomic. 5254 5255 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 5256 - system vmcnt(0) 5257 5258 - If OpenCL and 5259 address space is 5260 not generic, omit 5261 lgkmcnt(0). 5262 - If OpenCL and 5263 address space is 5264 local, omit 5265 vmcnt(0). 5266 - However, since LLVM 5267 currently has no 5268 address space on 5269 the fence need to 5270 conservatively 5271 always generate. If 5272 fence had an 5273 address space then 5274 set to address 5275 space of OpenCL 5276 fence flag, or to 5277 generic if both 5278 local and global 5279 flags are 5280 specified. 5281 - Could be split into 5282 separate s_waitcnt 5283 vmcnt(0) and 5284 s_waitcnt 5285 lgkmcnt(0) to allow 5286 them to be 5287 independently moved 5288 according to the 5289 following rules. 5290 - s_waitcnt vmcnt(0) 5291 must happen after 5292 any preceding 5293 global/generic 5294 load/store/load 5295 atomic/store 5296 atomic/atomicrmw. 5297 - s_waitcnt lgkmcnt(0) 5298 must happen after 5299 any preceding 5300 local/generic 5301 load/store/load 5302 atomic/store 5303 atomic/atomicrmw. 5304 - Must happen before 5305 any following store 5306 atomic/atomicrmw 5307 with an equal or 5308 wider sync scope 5309 and memory ordering 5310 stronger than 5311 unordered (this is 5312 termed the 5313 fence-paired-atomic). 5314 - Ensures that all 5315 memory operations 5316 have 5317 completed before 5318 performing the 5319 following 5320 fence-paired-atomic. 5321 5322 **Acquire-Release Atomic** 5323 ------------------------------------------------------------------------------------ 5324 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 5325 - wavefront - local 5326 - generic 5327 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 5328 5329 - If OpenCL, omit. 5330 - Must happen after 5331 any preceding 5332 local/generic 5333 load/store/load 5334 atomic/store 5335 atomic/atomicrmw. 5336 - Must happen before 5337 the following 5338 atomicrmw. 5339 - Ensures that all 5340 memory operations 5341 to local have 5342 completed before 5343 performing the 5344 atomicrmw that is 5345 being released. 5346 5347 2. buffer/global_atomic 5348 5349 atomicrmw acq_rel - workgroup - local 1. ds_atomic 5350 2. s_waitcnt lgkmcnt(0) 5351 5352 - If OpenCL, omit. 5353 - Must happen before 5354 any following 5355 global/generic 5356 load/load 5357 atomic/store/store 5358 atomic/atomicrmw. 5359 - Ensures any 5360 following global 5361 data read is no 5362 older than the local load 5363 atomic value being 5364 acquired. 5365 5366 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 5367 5368 - If OpenCL, omit. 5369 - Must happen after 5370 any preceding 5371 local/generic 5372 load/store/load 5373 atomic/store 5374 atomic/atomicrmw. 5375 - Must happen before 5376 the following 5377 atomicrmw. 5378 - Ensures that all 5379 memory operations 5380 to local have 5381 completed before 5382 performing the 5383 atomicrmw that is 5384 being released. 5385 5386 2. flat_atomic 5387 3. s_waitcnt lgkmcnt(0) 5388 5389 - If OpenCL, omit. 5390 - Must happen before 5391 any following 5392 global/generic 5393 load/load 5394 atomic/store/store 5395 atomic/atomicrmw. 5396 - Ensures any 5397 following global 5398 data read is no 5399 older than a local load 5400 atomic value being 5401 acquired. 5402 5403 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 5404 - system vmcnt(0) 5405 5406 - If OpenCL, omit 5407 lgkmcnt(0). 5408 - Could be split into 5409 separate s_waitcnt 5410 vmcnt(0) and 5411 s_waitcnt 5412 lgkmcnt(0) to allow 5413 them to be 5414 independently moved 5415 according to the 5416 following rules. 5417 - s_waitcnt vmcnt(0) 5418 must happen after 5419 any preceding 5420 global/generic 5421 load/store/load 5422 atomic/store 5423 atomic/atomicrmw. 5424 - s_waitcnt lgkmcnt(0) 5425 must happen after 5426 any preceding 5427 local/generic 5428 load/store/load 5429 atomic/store 5430 atomic/atomicrmw. 5431 - Must happen before 5432 the following 5433 atomicrmw. 5434 - Ensures that all 5435 memory operations 5436 to global have 5437 completed before 5438 performing the 5439 atomicrmw that is 5440 being released. 5441 5442 2. buffer/global_atomic 5443 3. s_waitcnt vmcnt(0) 5444 5445 - Must happen before 5446 following 5447 buffer_wbinvl1_vol. 5448 - Ensures the 5449 atomicrmw has 5450 completed before 5451 invalidating the 5452 cache. 5453 5454 4. buffer_wbinvl1_vol 5455 5456 - Must happen before 5457 any following 5458 global/generic 5459 load/load 5460 atomic/atomicrmw. 5461 - Ensures that 5462 following loads 5463 will not see stale 5464 global data. 5465 5466 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 5467 - system vmcnt(0) 5468 5469 - If OpenCL, omit 5470 lgkmcnt(0). 5471 - Could be split into 5472 separate s_waitcnt 5473 vmcnt(0) and 5474 s_waitcnt 5475 lgkmcnt(0) to allow 5476 them to be 5477 independently moved 5478 according to the 5479 following rules. 5480 - s_waitcnt vmcnt(0) 5481 must happen after 5482 any preceding 5483 global/generic 5484 load/store/load 5485 atomic/store 5486 atomic/atomicrmw. 5487 - s_waitcnt lgkmcnt(0) 5488 must happen after 5489 any preceding 5490 local/generic 5491 load/store/load 5492 atomic/store 5493 atomic/atomicrmw. 5494 - Must happen before 5495 the following 5496 atomicrmw. 5497 - Ensures that all 5498 memory operations 5499 to global have 5500 completed before 5501 performing the 5502 atomicrmw that is 5503 being released. 5504 5505 2. flat_atomic 5506 3. s_waitcnt vmcnt(0) & 5507 lgkmcnt(0) 5508 5509 - If OpenCL, omit 5510 lgkmcnt(0). 5511 - Must happen before 5512 following 5513 buffer_wbinvl1_vol. 5514 - Ensures the 5515 atomicrmw has 5516 completed before 5517 invalidating the 5518 cache. 5519 5520 4. buffer_wbinvl1_vol 5521 5522 - Must happen before 5523 any following 5524 global/generic 5525 load/load 5526 atomic/atomicrmw. 5527 - Ensures that 5528 following loads 5529 will not see stale 5530 global data. 5531 5532 fence acq_rel - singlethread *none* *none* 5533 - wavefront 5534 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 5535 5536 - If OpenCL and 5537 address space is 5538 not generic, omit. 5539 - However, 5540 since LLVM 5541 currently has no 5542 address space on 5543 the fence need to 5544 conservatively 5545 always generate 5546 (see comment for 5547 previous fence). 5548 - Must happen after 5549 any preceding 5550 local/generic 5551 load/load 5552 atomic/store/store 5553 atomic/atomicrmw. 5554 - Must happen before 5555 any following 5556 global/generic 5557 load/load 5558 atomic/store/store 5559 atomic/atomicrmw. 5560 - Ensures that all 5561 memory operations 5562 to local have 5563 completed before 5564 performing any 5565 following global 5566 memory operations. 5567 - Ensures that the 5568 preceding 5569 local/generic load 5570 atomic/atomicrmw 5571 with an equal or 5572 wider sync scope 5573 and memory ordering 5574 stronger than 5575 unordered (this is 5576 termed the 5577 acquire-fence-paired-atomic) 5578 has completed 5579 before following 5580 global memory 5581 operations. This 5582 satisfies the 5583 requirements of 5584 acquire. 5585 - Ensures that all 5586 previous memory 5587 operations have 5588 completed before a 5589 following 5590 local/generic store 5591 atomic/atomicrmw 5592 with an equal or 5593 wider sync scope 5594 and memory ordering 5595 stronger than 5596 unordered (this is 5597 termed the 5598 release-fence-paired-atomic). 5599 This satisfies the 5600 requirements of 5601 release. 5602 5603 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 5604 - system vmcnt(0) 5605 5606 - If OpenCL and 5607 address space is 5608 not generic, omit 5609 lgkmcnt(0). 5610 - However, since LLVM 5611 currently has no 5612 address space on 5613 the fence need to 5614 conservatively 5615 always generate 5616 (see comment for 5617 previous fence). 5618 - Could be split into 5619 separate s_waitcnt 5620 vmcnt(0) and 5621 s_waitcnt 5622 lgkmcnt(0) to allow 5623 them to be 5624 independently moved 5625 according to the 5626 following rules. 5627 - s_waitcnt vmcnt(0) 5628 must happen after 5629 any preceding 5630 global/generic 5631 load/store/load 5632 atomic/store 5633 atomic/atomicrmw. 5634 - s_waitcnt lgkmcnt(0) 5635 must happen after 5636 any preceding 5637 local/generic 5638 load/store/load 5639 atomic/store 5640 atomic/atomicrmw. 5641 - Must happen before 5642 the following 5643 buffer_wbinvl1_vol. 5644 - Ensures that the 5645 preceding 5646 global/local/generic 5647 load 5648 atomic/atomicrmw 5649 with an equal or 5650 wider sync scope 5651 and memory ordering 5652 stronger than 5653 unordered (this is 5654 termed the 5655 acquire-fence-paired-atomic) 5656 has completed 5657 before invalidating 5658 the cache. This 5659 satisfies the 5660 requirements of 5661 acquire. 5662 - Ensures that all 5663 previous memory 5664 operations have 5665 completed before a 5666 following 5667 global/local/generic 5668 store 5669 atomic/atomicrmw 5670 with an equal or 5671 wider sync scope 5672 and memory ordering 5673 stronger than 5674 unordered (this is 5675 termed the 5676 release-fence-paired-atomic). 5677 This satisfies the 5678 requirements of 5679 release. 5680 5681 2. buffer_wbinvl1_vol 5682 5683 - Must happen before 5684 any following 5685 global/generic 5686 load/load 5687 atomic/store/store 5688 atomic/atomicrmw. 5689 - Ensures that 5690 following loads 5691 will not see stale 5692 global data. This 5693 satisfies the 5694 requirements of 5695 acquire. 5696 5697 **Sequential Consistent Atomic** 5698 ------------------------------------------------------------------------------------ 5699 load atomic seq_cst - singlethread - global *Same as corresponding 5700 - wavefront - local load atomic acquire, 5701 - generic except must generated 5702 all instructions even 5703 for OpenCL.* 5704 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 5705 - generic 5706 5707 - Must 5708 happen after 5709 preceding 5710 local/generic load 5711 atomic/store 5712 atomic/atomicrmw 5713 with memory 5714 ordering of seq_cst 5715 and with equal or 5716 wider sync scope. 5717 (Note that seq_cst 5718 fences have their 5719 own s_waitcnt 5720 lgkmcnt(0) and so do 5721 not need to be 5722 considered.) 5723 - Ensures any 5724 preceding 5725 sequential 5726 consistent local 5727 memory instructions 5728 have completed 5729 before executing 5730 this sequentially 5731 consistent 5732 instruction. This 5733 prevents reordering 5734 a seq_cst store 5735 followed by a 5736 seq_cst load. (Note 5737 that seq_cst is 5738 stronger than 5739 acquire/release as 5740 the reordering of 5741 load acquire 5742 followed by a store 5743 release is 5744 prevented by the 5745 s_waitcnt of 5746 the release, but 5747 there is nothing 5748 preventing a store 5749 release followed by 5750 load acquire from 5751 completing out of 5752 order. The s_waitcnt 5753 could be placed after 5754 seq_store or before 5755 the seq_load. We 5756 choose the load to 5757 make the s_waitcnt be 5758 as late as possible 5759 so that the store 5760 may have already 5761 completed.) 5762 5763 2. *Following 5764 instructions same as 5765 corresponding load 5766 atomic acquire, 5767 except must generated 5768 all instructions even 5769 for OpenCL.* 5770 load atomic seq_cst - workgroup - local *Same as corresponding 5771 load atomic acquire, 5772 except must generated 5773 all instructions even 5774 for OpenCL.* 5775 5776 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 5777 - system - generic vmcnt(0) 5778 5779 - Could be split into 5780 separate s_waitcnt 5781 vmcnt(0) 5782 and s_waitcnt 5783 lgkmcnt(0) to allow 5784 them to be 5785 independently moved 5786 according to the 5787 following rules. 5788 - s_waitcnt lgkmcnt(0) 5789 must happen after 5790 preceding 5791 global/generic load 5792 atomic/store 5793 atomic/atomicrmw 5794 with memory 5795 ordering of seq_cst 5796 and with equal or 5797 wider sync scope. 5798 (Note that seq_cst 5799 fences have their 5800 own s_waitcnt 5801 lgkmcnt(0) and so do 5802 not need to be 5803 considered.) 5804 - s_waitcnt vmcnt(0) 5805 must happen after 5806 preceding 5807 global/generic load 5808 atomic/store 5809 atomic/atomicrmw 5810 with memory 5811 ordering of seq_cst 5812 and with equal or 5813 wider sync scope. 5814 (Note that seq_cst 5815 fences have their 5816 own s_waitcnt 5817 vmcnt(0) and so do 5818 not need to be 5819 considered.) 5820 - Ensures any 5821 preceding 5822 sequential 5823 consistent global 5824 memory instructions 5825 have completed 5826 before executing 5827 this sequentially 5828 consistent 5829 instruction. This 5830 prevents reordering 5831 a seq_cst store 5832 followed by a 5833 seq_cst load. (Note 5834 that seq_cst is 5835 stronger than 5836 acquire/release as 5837 the reordering of 5838 load acquire 5839 followed by a store 5840 release is 5841 prevented by the 5842 s_waitcnt of 5843 the release, but 5844 there is nothing 5845 preventing a store 5846 release followed by 5847 load acquire from 5848 completing out of 5849 order. The s_waitcnt 5850 could be placed after 5851 seq_store or before 5852 the seq_load. We 5853 choose the load to 5854 make the s_waitcnt be 5855 as late as possible 5856 so that the store 5857 may have already 5858 completed.) 5859 5860 2. *Following 5861 instructions same as 5862 corresponding load 5863 atomic acquire, 5864 except must generated 5865 all instructions even 5866 for OpenCL.* 5867 store atomic seq_cst - singlethread - global *Same as corresponding 5868 - wavefront - local store atomic release, 5869 - workgroup - generic except must generated 5870 - agent all instructions even 5871 - system for OpenCL.* 5872 atomicrmw seq_cst - singlethread - global *Same as corresponding 5873 - wavefront - local atomicrmw acq_rel, 5874 - workgroup - generic except must generated 5875 - agent all instructions even 5876 - system for OpenCL.* 5877 fence seq_cst - singlethread *none* *Same as corresponding 5878 - wavefront fence acq_rel, 5879 - workgroup except must generated 5880 - agent all instructions even 5881 - system for OpenCL.* 5882 ============ ============ ============== ========== ================================ 5883 5884.. _amdgpu-amdhsa-memory-model-gfx10: 5885 5886Memory Model GFX10 5887++++++++++++++++++ 5888 5889For GFX10: 5890 5891* Each agent has multiple shader arrays (SA). 5892* Each SA has multiple work-group processors (WGP). 5893* Each WGP has multiple compute units (CU). 5894* Each CU has multiple SIMDs that execute wavefronts. 5895* The wavefronts for a single work-group are executed in the same 5896 WGP. In CU wavefront execution mode the wavefronts may be executed by 5897 different SIMDs in the same CU. In WGP wavefront execution mode the 5898 wavefronts may be executed by different SIMDs in different CUs in the same 5899 WGP. 5900* Each WGP has a single LDS memory shared by the wavefronts of the work-groups 5901 executing on it. 5902* All LDS operations of a WGP are performed as wavefront wide operations in a 5903 global order and involve no caching. Completion is reported to a wavefront in 5904 execution order. 5905* The LDS memory has multiple request queues shared by the SIMDs of a 5906 WGP. Therefore, the LDS operations performed by different wavefronts of a 5907 work-group can be reordered relative to each other, which can result in 5908 reordering the visibility of vector memory operations with respect to LDS 5909 operations of other wavefronts in the same work-group. A ``s_waitcnt 5910 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and 5911 vector memory operations between wavefronts of a work-group, but not between 5912 operations performed by the same wavefront. 5913* The vector memory operations are performed as wavefront wide operations. 5914 Completion of load/store/sample operations are reported to a wavefront in 5915 execution order of other load/store/sample operations performed by that 5916 wavefront. 5917* The vector memory operations access a vector L0 cache. There is a single L0 5918 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no 5919 special action is required for coherence between the lanes of a single 5920 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between 5921 wavefronts executing in the same work-group as they may be executing on SIMDs 5922 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also 5923 required for coherence between wavefronts executing in different work-groups 5924 as they may be executing on different WGPs. 5925* The scalar memory operations access a scalar L0 cache shared by all wavefronts 5926 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar 5927 operations are used in a restricted way so do not impact the memory model. See 5928 :ref:`amdgpu-amdhsa-memory-spaces`. 5929* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on 5930 the same SA. Therefore, no special action is required for coherence between 5931 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is 5932 required for coherence between wavefronts executing in different work-groups 5933 as they may be executing on different SAs that access different L1s. 5934* The L1 caches have independent quadrants to service disjoint ranges of virtual 5935 addresses. 5936* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the 5937 vector and scalar memory operations performed by different wavefronts, whether 5938 executing in the same or different work-groups (which may be executing on 5939 different CUs accessing different L0s), can be reordered relative to each 5940 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure 5941 synchronization between vector memory operations of different wavefronts. It 5942 ensures a previous vector memory operation has completed before executing a 5943 subsequent vector memory or LDS operation and so can be used to meet the 5944 requirements of acquire, release and sequential consistency. 5945* The L1 caches use an L2 cache shared by all SAs on the same agent. 5946* The L2 cache has independent channels to service disjoint ranges of virtual 5947 addresses. 5948* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 5949 quadrant has a separate request queue per L2 channel. Therefore, the vector 5950 and scalar memory operations performed by wavefronts executing in different 5951 work-groups (which may be executing on different SAs) of an agent can be 5952 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is 5953 required to ensure synchronization between vector memory operations of 5954 different SAs. It ensures a previous vector memory operation has completed 5955 before executing a subsequent vector memory and so can be used to meet the 5956 requirements of acquire, release and sequential consistency. 5957* The L2 cache can be kept coherent with other agents on some targets, or ranges 5958 of virtual addresses can be set up to bypass it to ensure system coherence. 5959 5960Scalar memory operations are only used to access memory that is proven to not 5961change during the execution of the kernel dispatch. This includes constant 5962address space and global address space for program scope ``const`` variables. 5963Therefore, the kernel machine code does not have to maintain the scalar cache to 5964ensure it is coherent with the vector caches. The scalar and vector caches are 5965invalidated between kernel dispatches by CP since constant address space data 5966may change between kernel dispatch executions. See 5967:ref:`amdgpu-amdhsa-memory-spaces`. 5968 5969The one exception is if scalar writes are used to spill SGPR registers. In this 5970case the AMDGPU backend ensures the memory location used to spill is never 5971accessed by vector memory operations at the same time. If scalar writes are used 5972then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 5973return since the locations may be used for vector memory instructions by a 5974future wavefront that uses the same scratch area, or a function call that 5975creates a frame at the same address, respectively. There is no need for a 5976``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. 5977 5978For kernarg backing memory: 5979 5980* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. 5981* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid 5982 needing to invalidate the L2 cache. 5983* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and 5984 so the L2 cache will be coherent with the CPU and other agents. 5985 5986Scratch backing memory (which is used for the private address space) is accessed 5987with MTYPE NC (non-coherent). Since the private address space is only accessed 5988by a single thread, and is always write-before-read, there is never a need to 5989invalidate these entries from the L0 or L1 caches. 5990 5991Wavefronts are executed in native mode with in-order reporting of loads and 5992sample instructions. In this mode vmcnt reports completion of load, atomic with 5993return and sample instructions in order, and the vscnt reports the completion of 5994store and atomic without return in order. See ``MEM_ORDERED`` field in 5995:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 5996 5997Wavefronts can be executed in WGP or CU wavefront execution mode: 5998 5999* In WGP wavefront execution mode the wavefronts of a work-group are executed 6000 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per 6001 CU L0 caches is required for work-group synchronization. Also accesses to L1 6002 at work-group scope need to be explicitly ordered as the accesses from 6003 different CUs are not ordered. 6004* In CU wavefront execution mode the wavefronts of a work-group are executed on 6005 the SIMDs of a single CU of the WGP. Therefore, all global memory access by 6006 the work-group access the same L0 which in turn ensures L1 accesses are 6007 ordered and so do not require explicit management of the caches for 6008 work-group synchronization. 6009 6010See ``WGP_MODE`` field in 6011:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and 6012:ref:`amdgpu-target-features`. 6013 6014The code sequences used to implement the memory model for GFX10 are defined in 6015table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`. 6016 6017 .. table:: AMDHSA Memory Model Code Sequences GFX10 6018 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table 6019 6020 ============ ============ ============== ========== ================================ 6021 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 6022 Ordering Sync Scope Address GFX10 6023 Space 6024 ============ ============ ============== ========== ================================ 6025 **Non-Atomic** 6026 ------------------------------------------------------------------------------------ 6027 load *none* *none* - global - !volatile & !nontemporal 6028 - generic 6029 - private 1. buffer/global/flat_load 6030 - constant 6031 - volatile & !nontemporal 6032 6033 1. buffer/global/flat_load 6034 glc=1 dlc=1 6035 6036 - nontemporal 6037 6038 1. buffer/global/flat_load 6039 slc=1 6040 6041 load *none* *none* - local 1. ds_load 6042 store *none* *none* - global - !nontemporal 6043 - generic 6044 - private 1. buffer/global/flat_store 6045 - constant 6046 - nontemporal 6047 6048 1. buffer/global/flat_store 6049 slc=1 6050 6051 store *none* *none* - local 1. ds_store 6052 **Unordered Atomic** 6053 ------------------------------------------------------------------------------------ 6054 load atomic unordered *any* *any* *Same as non-atomic*. 6055 store atomic unordered *any* *any* *Same as non-atomic*. 6056 atomicrmw unordered *any* *any* *Same as monotonic atomic*. 6057 **Monotonic Atomic** 6058 ------------------------------------------------------------------------------------ 6059 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 6060 - wavefront - generic 6061 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 6062 - generic glc=1 6063 6064 - If CU wavefront execution 6065 mode, omit glc=1. 6066 6067 load atomic monotonic - singlethread - local 1. ds_load 6068 - wavefront 6069 - workgroup 6070 load atomic monotonic - agent - global 1. buffer/global/flat_load 6071 - system - generic glc=1 dlc=1 6072 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 6073 - wavefront - generic 6074 - workgroup 6075 - agent 6076 - system 6077 store atomic monotonic - singlethread - local 1. ds_store 6078 - wavefront 6079 - workgroup 6080 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 6081 - wavefront - generic 6082 - workgroup 6083 - agent 6084 - system 6085 atomicrmw monotonic - singlethread - local 1. ds_atomic 6086 - wavefront 6087 - workgroup 6088 **Acquire Atomic** 6089 ------------------------------------------------------------------------------------ 6090 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 6091 - wavefront - local 6092 - generic 6093 load atomic acquire - workgroup - global 1. buffer/global_load glc=1 6094 6095 - If CU wavefront execution 6096 mode, omit glc=1. 6097 6098 2. s_waitcnt vmcnt(0) 6099 6100 - If CU wavefront execution 6101 mode, omit. 6102 - Must happen before 6103 the following buffer_gl0_inv 6104 and before any following 6105 global/generic 6106 load/load 6107 atomic/store/store 6108 atomic/atomicrmw. 6109 6110 3. buffer_gl0_inv 6111 6112 - If CU wavefront execution 6113 mode, omit. 6114 - Ensures that 6115 following 6116 loads will not see 6117 stale data. 6118 6119 load atomic acquire - workgroup - local 1. ds_load 6120 2. s_waitcnt lgkmcnt(0) 6121 6122 - If OpenCL, omit. 6123 - Must happen before 6124 the following buffer_gl0_inv 6125 and before any following 6126 global/generic load/load 6127 atomic/store/store 6128 atomic/atomicrmw. 6129 - Ensures any 6130 following global 6131 data read is no 6132 older than the local load 6133 atomic value being 6134 acquired. 6135 6136 3. buffer_gl0_inv 6137 6138 - If CU wavefront execution 6139 mode, omit. 6140 - If OpenCL, omit. 6141 - Ensures that 6142 following 6143 loads will not see 6144 stale data. 6145 6146 load atomic acquire - workgroup - generic 1. flat_load glc=1 6147 6148 - If CU wavefront execution 6149 mode, omit glc=1. 6150 6151 2. s_waitcnt lgkmcnt(0) & 6152 vmcnt(0) 6153 6154 - If CU wavefront execution 6155 mode, omit vmcnt(0). 6156 - If OpenCL, omit 6157 lgkmcnt(0). 6158 - Must happen before 6159 the following 6160 buffer_gl0_inv and any 6161 following global/generic 6162 load/load 6163 atomic/store/store 6164 atomic/atomicrmw. 6165 - Ensures any 6166 following global 6167 data read is no 6168 older than a local load 6169 atomic value being 6170 acquired. 6171 6172 3. buffer_gl0_inv 6173 6174 - If CU wavefront execution 6175 mode, omit. 6176 - Ensures that 6177 following 6178 loads will not see 6179 stale data. 6180 6181 load atomic acquire - agent - global 1. buffer/global_load 6182 - system glc=1 dlc=1 6183 2. s_waitcnt vmcnt(0) 6184 6185 - Must happen before 6186 following 6187 buffer_gl*_inv. 6188 - Ensures the load 6189 has completed 6190 before invalidating 6191 the caches. 6192 6193 3. buffer_gl0_inv; 6194 buffer_gl1_inv 6195 6196 - Must happen before 6197 any following 6198 global/generic 6199 load/load 6200 atomic/atomicrmw. 6201 - Ensures that 6202 following 6203 loads will not see 6204 stale global data. 6205 6206 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 6207 - system 2. s_waitcnt vmcnt(0) & 6208 lgkmcnt(0) 6209 6210 - If OpenCL omit 6211 lgkmcnt(0). 6212 - Must happen before 6213 following 6214 buffer_gl*_invl. 6215 - Ensures the flat_load 6216 has completed 6217 before invalidating 6218 the caches. 6219 6220 3. buffer_gl0_inv; 6221 buffer_gl1_inv 6222 6223 - Must happen before 6224 any following 6225 global/generic 6226 load/load 6227 atomic/atomicrmw. 6228 - Ensures that 6229 following loads 6230 will not see stale 6231 global data. 6232 6233 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 6234 - wavefront - local 6235 - generic 6236 atomicrmw acquire - workgroup - global 1. buffer/global_atomic 6237 2. s_waitcnt vm/vscnt(0) 6238 6239 - If CU wavefront execution 6240 mode, omit. 6241 - Use vmcnt(0) if atomic with 6242 return and vscnt(0) if 6243 atomic with no-return. 6244 - Must happen before 6245 the following buffer_gl0_inv 6246 and before any following 6247 global/generic 6248 load/load 6249 atomic/store/store 6250 atomic/atomicrmw. 6251 6252 3. buffer_gl0_inv 6253 6254 - If CU wavefront execution 6255 mode, omit. 6256 - Ensures that 6257 following 6258 loads will not see 6259 stale data. 6260 6261 atomicrmw acquire - workgroup - local 1. ds_atomic 6262 2. s_waitcnt lgkmcnt(0) 6263 6264 - If OpenCL, omit. 6265 - Must happen before 6266 the following 6267 buffer_gl0_inv. 6268 - Ensures any 6269 following global 6270 data read is no 6271 older than the local 6272 atomicrmw value 6273 being acquired. 6274 6275 3. buffer_gl0_inv 6276 6277 - If OpenCL omit. 6278 - Ensures that 6279 following 6280 loads will not see 6281 stale data. 6282 6283 atomicrmw acquire - workgroup - generic 1. flat_atomic 6284 2. s_waitcnt lgkmcnt(0) & 6285 vm/vscnt(0) 6286 6287 - If CU wavefront execution 6288 mode, omit vm/vscnt(0). 6289 - If OpenCL, omit lgkmcnt(0). 6290 - Use vmcnt(0) if atomic with 6291 return and vscnt(0) if 6292 atomic with no-return. 6293 - Must happen before 6294 the following 6295 buffer_gl0_inv. 6296 - Ensures any 6297 following global 6298 data read is no 6299 older than a local 6300 atomicrmw value 6301 being acquired. 6302 6303 3. buffer_gl0_inv 6304 6305 - If CU wavefront execution 6306 mode, omit. 6307 - Ensures that 6308 following 6309 loads will not see 6310 stale data. 6311 6312 atomicrmw acquire - agent - global 1. buffer/global_atomic 6313 - system 2. s_waitcnt vm/vscnt(0) 6314 6315 - Use vmcnt(0) if atomic with 6316 return and vscnt(0) if 6317 atomic with no-return. 6318 - Must happen before 6319 following 6320 buffer_gl*_inv. 6321 - Ensures the 6322 atomicrmw has 6323 completed before 6324 invalidating the 6325 caches. 6326 6327 3. buffer_gl0_inv; 6328 buffer_gl1_inv 6329 6330 - Must happen before 6331 any following 6332 global/generic 6333 load/load 6334 atomic/atomicrmw. 6335 - Ensures that 6336 following loads 6337 will not see stale 6338 global data. 6339 6340 atomicrmw acquire - agent - generic 1. flat_atomic 6341 - system 2. s_waitcnt vm/vscnt(0) & 6342 lgkmcnt(0) 6343 6344 - If OpenCL, omit 6345 lgkmcnt(0). 6346 - Use vmcnt(0) if atomic with 6347 return and vscnt(0) if 6348 atomic with no-return. 6349 - Must happen before 6350 following 6351 buffer_gl*_inv. 6352 - Ensures the 6353 atomicrmw has 6354 completed before 6355 invalidating the 6356 caches. 6357 6358 3. buffer_gl0_inv; 6359 buffer_gl1_inv 6360 6361 - Must happen before 6362 any following 6363 global/generic 6364 load/load 6365 atomic/atomicrmw. 6366 - Ensures that 6367 following loads 6368 will not see stale 6369 global data. 6370 6371 fence acquire - singlethread *none* *none* 6372 - wavefront 6373 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 6374 vmcnt(0) & vscnt(0) 6375 6376 - If CU wavefront execution 6377 mode, omit vmcnt(0) and 6378 vscnt(0). 6379 - If OpenCL and 6380 address space is 6381 not generic, omit 6382 lgkmcnt(0). 6383 - If OpenCL and 6384 address space is 6385 local, omit 6386 vmcnt(0) and vscnt(0). 6387 - However, since LLVM 6388 currently has no 6389 address space on 6390 the fence need to 6391 conservatively 6392 always generate. If 6393 fence had an 6394 address space then 6395 set to address 6396 space of OpenCL 6397 fence flag, or to 6398 generic if both 6399 local and global 6400 flags are 6401 specified. 6402 - Could be split into 6403 separate s_waitcnt 6404 vmcnt(0), s_waitcnt 6405 vscnt(0) and s_waitcnt 6406 lgkmcnt(0) to allow 6407 them to be 6408 independently moved 6409 according to the 6410 following rules. 6411 - s_waitcnt vmcnt(0) 6412 must happen after 6413 any preceding 6414 global/generic load 6415 atomic/ 6416 atomicrmw-with-return-value 6417 with an equal or 6418 wider sync scope 6419 and memory ordering 6420 stronger than 6421 unordered (this is 6422 termed the 6423 fence-paired-atomic). 6424 - s_waitcnt vscnt(0) 6425 must happen after 6426 any preceding 6427 global/generic 6428 atomicrmw-no-return-value 6429 with an equal or 6430 wider sync scope 6431 and memory ordering 6432 stronger than 6433 unordered (this is 6434 termed the 6435 fence-paired-atomic). 6436 - s_waitcnt lgkmcnt(0) 6437 must happen after 6438 any preceding 6439 local/generic load 6440 atomic/atomicrmw 6441 with an equal or 6442 wider sync scope 6443 and memory ordering 6444 stronger than 6445 unordered (this is 6446 termed the 6447 fence-paired-atomic). 6448 - Must happen before 6449 the following 6450 buffer_gl0_inv. 6451 - Ensures that the 6452 fence-paired atomic 6453 has completed 6454 before invalidating 6455 the 6456 cache. Therefore 6457 any following 6458 locations read must 6459 be no older than 6460 the value read by 6461 the 6462 fence-paired-atomic. 6463 6464 3. buffer_gl0_inv 6465 6466 - If CU wavefront execution 6467 mode, omit. 6468 - Ensures that 6469 following 6470 loads will not see 6471 stale data. 6472 6473 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 6474 - system vmcnt(0) & vscnt(0) 6475 6476 - If OpenCL and 6477 address space is 6478 not generic, omit 6479 lgkmcnt(0). 6480 - If OpenCL and 6481 address space is 6482 local, omit 6483 vmcnt(0) and vscnt(0). 6484 - However, since LLVM 6485 currently has no 6486 address space on 6487 the fence need to 6488 conservatively 6489 always generate 6490 (see comment for 6491 previous fence). 6492 - Could be split into 6493 separate s_waitcnt 6494 vmcnt(0), s_waitcnt 6495 vscnt(0) and s_waitcnt 6496 lgkmcnt(0) to allow 6497 them to be 6498 independently moved 6499 according to the 6500 following rules. 6501 - s_waitcnt vmcnt(0) 6502 must happen after 6503 any preceding 6504 global/generic load 6505 atomic/ 6506 atomicrmw-with-return-value 6507 with an equal or 6508 wider sync scope 6509 and memory ordering 6510 stronger than 6511 unordered (this is 6512 termed the 6513 fence-paired-atomic). 6514 - s_waitcnt vscnt(0) 6515 must happen after 6516 any preceding 6517 global/generic 6518 atomicrmw-no-return-value 6519 with an equal or 6520 wider sync scope 6521 and memory ordering 6522 stronger than 6523 unordered (this is 6524 termed the 6525 fence-paired-atomic). 6526 - s_waitcnt lgkmcnt(0) 6527 must happen after 6528 any preceding 6529 local/generic load 6530 atomic/atomicrmw 6531 with an equal or 6532 wider sync scope 6533 and memory ordering 6534 stronger than 6535 unordered (this is 6536 termed the 6537 fence-paired-atomic). 6538 - Must happen before 6539 the following 6540 buffer_gl*_inv. 6541 - Ensures that the 6542 fence-paired atomic 6543 has completed 6544 before invalidating 6545 the 6546 caches. Therefore 6547 any following 6548 locations read must 6549 be no older than 6550 the value read by 6551 the 6552 fence-paired-atomic. 6553 6554 2. buffer_gl0_inv; 6555 buffer_gl1_inv 6556 6557 - Must happen before any 6558 following global/generic 6559 load/load 6560 atomic/store/store 6561 atomic/atomicrmw. 6562 - Ensures that 6563 following loads 6564 will not see stale 6565 global data. 6566 6567 **Release Atomic** 6568 ------------------------------------------------------------------------------------ 6569 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 6570 - wavefront - local 6571 - generic 6572 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 6573 - generic vmcnt(0) & vscnt(0) 6574 6575 - If CU wavefront execution 6576 mode, omit vmcnt(0) and 6577 vscnt(0). 6578 - If OpenCL, omit 6579 lgkmcnt(0). 6580 - Could be split into 6581 separate s_waitcnt 6582 vmcnt(0), s_waitcnt 6583 vscnt(0) and s_waitcnt 6584 lgkmcnt(0) to allow 6585 them to be 6586 independently moved 6587 according to the 6588 following rules. 6589 - s_waitcnt vmcnt(0) 6590 must happen after 6591 any preceding 6592 global/generic load/load 6593 atomic/ 6594 atomicrmw-with-return-value. 6595 - s_waitcnt vscnt(0) 6596 must happen after 6597 any preceding 6598 global/generic 6599 store/store 6600 atomic/ 6601 atomicrmw-no-return-value. 6602 - s_waitcnt lgkmcnt(0) 6603 must happen after 6604 any preceding 6605 local/generic 6606 load/store/load 6607 atomic/store 6608 atomic/atomicrmw. 6609 - Must happen before 6610 the following 6611 store. 6612 - Ensures that all 6613 memory operations 6614 have 6615 completed before 6616 performing the 6617 store that is being 6618 released. 6619 6620 2. buffer/global/flat_store 6621 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 6622 6623 - If CU wavefront execution 6624 mode, omit. 6625 - If OpenCL, omit. 6626 - Could be split into 6627 separate s_waitcnt 6628 vmcnt(0) and s_waitcnt 6629 vscnt(0) to allow 6630 them to be 6631 independently moved 6632 according to the 6633 following rules. 6634 - s_waitcnt vmcnt(0) 6635 must happen after 6636 any preceding 6637 global/generic load/load 6638 atomic/ 6639 atomicrmw-with-return-value. 6640 - s_waitcnt vscnt(0) 6641 must happen after 6642 any preceding 6643 global/generic 6644 store/store atomic/ 6645 atomicrmw-no-return-value. 6646 - Must happen before 6647 the following 6648 store. 6649 - Ensures that all 6650 global memory 6651 operations have 6652 completed before 6653 performing the 6654 store that is being 6655 released. 6656 6657 2. ds_store 6658 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 6659 - system - generic vmcnt(0) & vscnt(0) 6660 6661 - If OpenCL and 6662 address space is 6663 not generic, omit 6664 lgkmcnt(0). 6665 - Could be split into 6666 separate s_waitcnt 6667 vmcnt(0), s_waitcnt vscnt(0) 6668 and s_waitcnt 6669 lgkmcnt(0) to allow 6670 them to be 6671 independently moved 6672 according to the 6673 following rules. 6674 - s_waitcnt vmcnt(0) 6675 must happen after 6676 any preceding 6677 global/generic 6678 load/load 6679 atomic/ 6680 atomicrmw-with-return-value. 6681 - s_waitcnt vscnt(0) 6682 must happen after 6683 any preceding 6684 global/generic 6685 store/store atomic/ 6686 atomicrmw-no-return-value. 6687 - s_waitcnt lgkmcnt(0) 6688 must happen after 6689 any preceding 6690 local/generic 6691 load/store/load 6692 atomic/store 6693 atomic/atomicrmw. 6694 - Must happen before 6695 the following 6696 store. 6697 - Ensures that all 6698 memory operations 6699 have 6700 completed before 6701 performing the 6702 store that is being 6703 released. 6704 6705 2. buffer/global/flat_store 6706 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 6707 - wavefront - local 6708 - generic 6709 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & 6710 - generic vmcnt(0) & vscnt(0) 6711 6712 - If CU wavefront execution 6713 mode, omit vmcnt(0) and 6714 vscnt(0). 6715 - If OpenCL, omit lgkmcnt(0). 6716 - Could be split into 6717 separate s_waitcnt 6718 vmcnt(0), s_waitcnt 6719 vscnt(0) and s_waitcnt 6720 lgkmcnt(0) to allow 6721 them to be 6722 independently moved 6723 according to the 6724 following rules. 6725 - s_waitcnt vmcnt(0) 6726 must happen after 6727 any preceding 6728 global/generic load/load 6729 atomic/ 6730 atomicrmw-with-return-value. 6731 - s_waitcnt vscnt(0) 6732 must happen after 6733 any preceding 6734 global/generic 6735 store/store 6736 atomic/ 6737 atomicrmw-no-return-value. 6738 - s_waitcnt lgkmcnt(0) 6739 must happen after 6740 any preceding 6741 local/generic 6742 load/store/load 6743 atomic/store 6744 atomic/atomicrmw. 6745 - Must happen before 6746 the following 6747 atomicrmw. 6748 - Ensures that all 6749 memory operations 6750 have 6751 completed before 6752 performing the 6753 atomicrmw that is 6754 being released. 6755 6756 2. buffer/global/flat_atomic 6757 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 6758 6759 - If CU wavefront execution 6760 mode, omit. 6761 - If OpenCL, omit. 6762 - Could be split into 6763 separate s_waitcnt 6764 vmcnt(0) and s_waitcnt 6765 vscnt(0) to allow 6766 them to be 6767 independently moved 6768 according to the 6769 following rules. 6770 - s_waitcnt vmcnt(0) 6771 must happen after 6772 any preceding 6773 global/generic load/load 6774 atomic/ 6775 atomicrmw-with-return-value. 6776 - s_waitcnt vscnt(0) 6777 must happen after 6778 any preceding 6779 global/generic 6780 store/store atomic/ 6781 atomicrmw-no-return-value. 6782 - Must happen before 6783 the following 6784 store. 6785 - Ensures that all 6786 global memory 6787 operations have 6788 completed before 6789 performing the 6790 store that is being 6791 released. 6792 6793 2. ds_atomic 6794 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 6795 - system - generic vmcnt(0) & vscnt(0) 6796 6797 - If OpenCL, omit 6798 lgkmcnt(0). 6799 - Could be split into 6800 separate s_waitcnt 6801 vmcnt(0), s_waitcnt 6802 vscnt(0) and s_waitcnt 6803 lgkmcnt(0) to allow 6804 them to be 6805 independently moved 6806 according to the 6807 following rules. 6808 - s_waitcnt vmcnt(0) 6809 must happen after 6810 any preceding 6811 global/generic 6812 load/load atomic/ 6813 atomicrmw-with-return-value. 6814 - s_waitcnt vscnt(0) 6815 must happen after 6816 any preceding 6817 global/generic 6818 store/store atomic/ 6819 atomicrmw-no-return-value. 6820 - s_waitcnt lgkmcnt(0) 6821 must happen after 6822 any preceding 6823 local/generic 6824 load/store/load 6825 atomic/store 6826 atomic/atomicrmw. 6827 - Must happen before 6828 the following 6829 atomicrmw. 6830 - Ensures that all 6831 memory operations 6832 to global and local 6833 have completed 6834 before performing 6835 the atomicrmw that 6836 is being released. 6837 6838 2. buffer/global/flat_atomic 6839 fence release - singlethread *none* *none* 6840 - wavefront 6841 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 6842 vmcnt(0) & vscnt(0) 6843 6844 - If CU wavefront execution 6845 mode, omit vmcnt(0) and 6846 vscnt(0). 6847 - If OpenCL and 6848 address space is 6849 not generic, omit 6850 lgkmcnt(0). 6851 - If OpenCL and 6852 address space is 6853 local, omit 6854 vmcnt(0) and vscnt(0). 6855 - However, since LLVM 6856 currently has no 6857 address space on 6858 the fence need to 6859 conservatively 6860 always generate. If 6861 fence had an 6862 address space then 6863 set to address 6864 space of OpenCL 6865 fence flag, or to 6866 generic if both 6867 local and global 6868 flags are 6869 specified. 6870 - Could be split into 6871 separate s_waitcnt 6872 vmcnt(0), s_waitcnt 6873 vscnt(0) and s_waitcnt 6874 lgkmcnt(0) to allow 6875 them to be 6876 independently moved 6877 according to the 6878 following rules. 6879 - s_waitcnt vmcnt(0) 6880 must happen after 6881 any preceding 6882 global/generic 6883 load/load 6884 atomic/ 6885 atomicrmw-with-return-value. 6886 - s_waitcnt vscnt(0) 6887 must happen after 6888 any preceding 6889 global/generic 6890 store/store atomic/ 6891 atomicrmw-no-return-value. 6892 - s_waitcnt lgkmcnt(0) 6893 must happen after 6894 any preceding 6895 local/generic 6896 load/store/load 6897 atomic/store atomic/ 6898 atomicrmw. 6899 - Must happen before 6900 any following store 6901 atomic/atomicrmw 6902 with an equal or 6903 wider sync scope 6904 and memory ordering 6905 stronger than 6906 unordered (this is 6907 termed the 6908 fence-paired-atomic). 6909 - Ensures that all 6910 memory operations 6911 have 6912 completed before 6913 performing the 6914 following 6915 fence-paired-atomic. 6916 6917 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 6918 - system vmcnt(0) & vscnt(0) 6919 6920 - If OpenCL and 6921 address space is 6922 not generic, omit 6923 lgkmcnt(0). 6924 - If OpenCL and 6925 address space is 6926 local, omit 6927 vmcnt(0) and vscnt(0). 6928 - However, since LLVM 6929 currently has no 6930 address space on 6931 the fence need to 6932 conservatively 6933 always generate. If 6934 fence had an 6935 address space then 6936 set to address 6937 space of OpenCL 6938 fence flag, or to 6939 generic if both 6940 local and global 6941 flags are 6942 specified. 6943 - Could be split into 6944 separate s_waitcnt 6945 vmcnt(0), s_waitcnt 6946 vscnt(0) and s_waitcnt 6947 lgkmcnt(0) to allow 6948 them to be 6949 independently moved 6950 according to the 6951 following rules. 6952 - s_waitcnt vmcnt(0) 6953 must happen after 6954 any preceding 6955 global/generic 6956 load/load atomic/ 6957 atomicrmw-with-return-value. 6958 - s_waitcnt vscnt(0) 6959 must happen after 6960 any preceding 6961 global/generic 6962 store/store atomic/ 6963 atomicrmw-no-return-value. 6964 - s_waitcnt lgkmcnt(0) 6965 must happen after 6966 any preceding 6967 local/generic 6968 load/store/load 6969 atomic/store 6970 atomic/atomicrmw. 6971 - Must happen before 6972 any following store 6973 atomic/atomicrmw 6974 with an equal or 6975 wider sync scope 6976 and memory ordering 6977 stronger than 6978 unordered (this is 6979 termed the 6980 fence-paired-atomic). 6981 - Ensures that all 6982 memory operations 6983 have 6984 completed before 6985 performing the 6986 following 6987 fence-paired-atomic. 6988 6989 **Acquire-Release Atomic** 6990 ------------------------------------------------------------------------------------ 6991 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 6992 - wavefront - local 6993 - generic 6994 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & 6995 vmcnt(0) & vscnt(0) 6996 6997 - If CU wavefront execution 6998 mode, omit vmcnt(0) and 6999 vscnt(0). 7000 - If OpenCL, omit 7001 lgkmcnt(0). 7002 - Must happen after 7003 any preceding 7004 local/generic 7005 load/store/load 7006 atomic/store 7007 atomic/atomicrmw. 7008 - Could be split into 7009 separate s_waitcnt 7010 vmcnt(0), s_waitcnt 7011 vscnt(0), and s_waitcnt 7012 lgkmcnt(0) to allow 7013 them to be 7014 independently moved 7015 according to the 7016 following rules. 7017 - s_waitcnt vmcnt(0) 7018 must happen after 7019 any preceding 7020 global/generic load/load 7021 atomic/ 7022 atomicrmw-with-return-value. 7023 - s_waitcnt vscnt(0) 7024 must happen after 7025 any preceding 7026 global/generic 7027 store/store 7028 atomic/ 7029 atomicrmw-no-return-value. 7030 - s_waitcnt lgkmcnt(0) 7031 must happen after 7032 any preceding 7033 local/generic 7034 load/store/load 7035 atomic/store 7036 atomic/atomicrmw. 7037 - Must happen before 7038 the following 7039 atomicrmw. 7040 - Ensures that all 7041 memory operations 7042 have 7043 completed before 7044 performing the 7045 atomicrmw that is 7046 being released. 7047 7048 2. buffer/global_atomic 7049 3. s_waitcnt vm/vscnt(0) 7050 7051 - If CU wavefront execution 7052 mode, omit. 7053 - Use vmcnt(0) if atomic with 7054 return and vscnt(0) if 7055 atomic with no-return. 7056 - Must happen before 7057 the following 7058 buffer_gl0_inv. 7059 - Ensures any 7060 following global 7061 data read is no 7062 older than the 7063 atomicrmw value 7064 being acquired. 7065 7066 4. buffer_gl0_inv 7067 7068 - If CU wavefront execution 7069 mode, omit. 7070 - Ensures that 7071 following 7072 loads will not see 7073 stale data. 7074 7075 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) 7076 7077 - If CU wavefront execution 7078 mode, omit. 7079 - If OpenCL, omit. 7080 - Could be split into 7081 separate s_waitcnt 7082 vmcnt(0) and s_waitcnt 7083 vscnt(0) to allow 7084 them to be 7085 independently moved 7086 according to the 7087 following rules. 7088 - s_waitcnt vmcnt(0) 7089 must happen after 7090 any preceding 7091 global/generic load/load 7092 atomic/ 7093 atomicrmw-with-return-value. 7094 - s_waitcnt vscnt(0) 7095 must happen after 7096 any preceding 7097 global/generic 7098 store/store atomic/ 7099 atomicrmw-no-return-value. 7100 - Must happen before 7101 the following 7102 store. 7103 - Ensures that all 7104 global memory 7105 operations have 7106 completed before 7107 performing the 7108 store that is being 7109 released. 7110 7111 2. ds_atomic 7112 3. s_waitcnt lgkmcnt(0) 7113 7114 - If OpenCL, omit. 7115 - Must happen before 7116 the following 7117 buffer_gl0_inv. 7118 - Ensures any 7119 following global 7120 data read is no 7121 older than the local load 7122 atomic value being 7123 acquired. 7124 7125 4. buffer_gl0_inv 7126 7127 - If CU wavefront execution 7128 mode, omit. 7129 - If OpenCL omit. 7130 - Ensures that 7131 following 7132 loads will not see 7133 stale data. 7134 7135 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & 7136 vmcnt(0) & vscnt(0) 7137 7138 - If CU wavefront execution 7139 mode, omit vmcnt(0) and 7140 vscnt(0). 7141 - If OpenCL, omit lgkmcnt(0). 7142 - Could be split into 7143 separate s_waitcnt 7144 vmcnt(0), s_waitcnt 7145 vscnt(0) and s_waitcnt 7146 lgkmcnt(0) to allow 7147 them to be 7148 independently moved 7149 according to the 7150 following rules. 7151 - s_waitcnt vmcnt(0) 7152 must happen after 7153 any preceding 7154 global/generic load/load 7155 atomic/ 7156 atomicrmw-with-return-value. 7157 - s_waitcnt vscnt(0) 7158 must happen after 7159 any preceding 7160 global/generic 7161 store/store 7162 atomic/ 7163 atomicrmw-no-return-value. 7164 - s_waitcnt lgkmcnt(0) 7165 must happen after 7166 any preceding 7167 local/generic 7168 load/store/load 7169 atomic/store 7170 atomic/atomicrmw. 7171 - Must happen before 7172 the following 7173 atomicrmw. 7174 - Ensures that all 7175 memory operations 7176 have 7177 completed before 7178 performing the 7179 atomicrmw that is 7180 being released. 7181 7182 2. flat_atomic 7183 3. s_waitcnt lgkmcnt(0) & 7184 vmcnt(0) & vscnt(0) 7185 7186 - If CU wavefront execution 7187 mode, omit vmcnt(0) and 7188 vscnt(0). 7189 - If OpenCL, omit lgkmcnt(0). 7190 - Must happen before 7191 the following 7192 buffer_gl0_inv. 7193 - Ensures any 7194 following global 7195 data read is no 7196 older than the load 7197 atomic value being 7198 acquired. 7199 7200 3. buffer_gl0_inv 7201 7202 - If CU wavefront execution 7203 mode, omit. 7204 - Ensures that 7205 following 7206 loads will not see 7207 stale data. 7208 7209 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 7210 - system vmcnt(0) & vscnt(0) 7211 7212 - If OpenCL, omit 7213 lgkmcnt(0). 7214 - Could be split into 7215 separate s_waitcnt 7216 vmcnt(0), s_waitcnt 7217 vscnt(0) and s_waitcnt 7218 lgkmcnt(0) to allow 7219 them to be 7220 independently moved 7221 according to the 7222 following rules. 7223 - s_waitcnt vmcnt(0) 7224 must happen after 7225 any preceding 7226 global/generic 7227 load/load atomic/ 7228 atomicrmw-with-return-value. 7229 - s_waitcnt vscnt(0) 7230 must happen after 7231 any preceding 7232 global/generic 7233 store/store atomic/ 7234 atomicrmw-no-return-value. 7235 - s_waitcnt lgkmcnt(0) 7236 must happen after 7237 any preceding 7238 local/generic 7239 load/store/load 7240 atomic/store 7241 atomic/atomicrmw. 7242 - Must happen before 7243 the following 7244 atomicrmw. 7245 - Ensures that all 7246 memory operations 7247 to global have 7248 completed before 7249 performing the 7250 atomicrmw that is 7251 being released. 7252 7253 2. buffer/global_atomic 7254 3. s_waitcnt vm/vscnt(0) 7255 7256 - Use vmcnt(0) if atomic with 7257 return and vscnt(0) if 7258 atomic with no-return. 7259 - Must happen before 7260 following 7261 buffer_gl*_inv. 7262 - Ensures the 7263 atomicrmw has 7264 completed before 7265 invalidating the 7266 caches. 7267 7268 4. buffer_gl0_inv; 7269 buffer_gl1_inv 7270 7271 - Must happen before 7272 any following 7273 global/generic 7274 load/load 7275 atomic/atomicrmw. 7276 - Ensures that 7277 following loads 7278 will not see stale 7279 global data. 7280 7281 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 7282 - system vmcnt(0) & vscnt(0) 7283 7284 - If OpenCL, omit 7285 lgkmcnt(0). 7286 - Could be split into 7287 separate s_waitcnt 7288 vmcnt(0), s_waitcnt 7289 vscnt(0), and s_waitcnt 7290 lgkmcnt(0) to allow 7291 them to be 7292 independently moved 7293 according to the 7294 following rules. 7295 - s_waitcnt vmcnt(0) 7296 must happen after 7297 any preceding 7298 global/generic 7299 load/load atomic 7300 atomicrmw-with-return-value. 7301 - s_waitcnt vscnt(0) 7302 must happen after 7303 any preceding 7304 global/generic 7305 store/store atomic/ 7306 atomicrmw-no-return-value. 7307 - s_waitcnt lgkmcnt(0) 7308 must happen after 7309 any preceding 7310 local/generic 7311 load/store/load 7312 atomic/store 7313 atomic/atomicrmw. 7314 - Must happen before 7315 the following 7316 atomicrmw. 7317 - Ensures that all 7318 memory operations 7319 have 7320 completed before 7321 performing the 7322 atomicrmw that is 7323 being released. 7324 7325 2. flat_atomic 7326 3. s_waitcnt vm/vscnt(0) & 7327 lgkmcnt(0) 7328 7329 - If OpenCL, omit 7330 lgkmcnt(0). 7331 - Use vmcnt(0) if atomic with 7332 return and vscnt(0) if 7333 atomic with no-return. 7334 - Must happen before 7335 following 7336 buffer_gl*_inv. 7337 - Ensures the 7338 atomicrmw has 7339 completed before 7340 invalidating the 7341 caches. 7342 7343 4. buffer_gl0_inv; 7344 buffer_gl1_inv 7345 7346 - Must happen before 7347 any following 7348 global/generic 7349 load/load 7350 atomic/atomicrmw. 7351 - Ensures that 7352 following loads 7353 will not see stale 7354 global data. 7355 7356 fence acq_rel - singlethread *none* *none* 7357 - wavefront 7358 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & 7359 vmcnt(0) & vscnt(0) 7360 7361 - If CU wavefront execution 7362 mode, omit vmcnt(0) and 7363 vscnt(0). 7364 - If OpenCL and 7365 address space is 7366 not generic, omit 7367 lgkmcnt(0). 7368 - If OpenCL and 7369 address space is 7370 local, omit 7371 vmcnt(0) and vscnt(0). 7372 - However, 7373 since LLVM 7374 currently has no 7375 address space on 7376 the fence need to 7377 conservatively 7378 always generate 7379 (see comment for 7380 previous fence). 7381 - Could be split into 7382 separate s_waitcnt 7383 vmcnt(0), s_waitcnt 7384 vscnt(0) and s_waitcnt 7385 lgkmcnt(0) to allow 7386 them to be 7387 independently moved 7388 according to the 7389 following rules. 7390 - s_waitcnt vmcnt(0) 7391 must happen after 7392 any preceding 7393 global/generic 7394 load/load 7395 atomic/ 7396 atomicrmw-with-return-value. 7397 - s_waitcnt vscnt(0) 7398 must happen after 7399 any preceding 7400 global/generic 7401 store/store atomic/ 7402 atomicrmw-no-return-value. 7403 - s_waitcnt lgkmcnt(0) 7404 must happen after 7405 any preceding 7406 local/generic 7407 load/store/load 7408 atomic/store atomic/ 7409 atomicrmw. 7410 - Must happen before 7411 any following 7412 global/generic 7413 load/load 7414 atomic/store/store 7415 atomic/atomicrmw. 7416 - Ensures that all 7417 memory operations 7418 have 7419 completed before 7420 performing any 7421 following global 7422 memory operations. 7423 - Ensures that the 7424 preceding 7425 local/generic load 7426 atomic/atomicrmw 7427 with an equal or 7428 wider sync scope 7429 and memory ordering 7430 stronger than 7431 unordered (this is 7432 termed the 7433 acquire-fence-paired-atomic) 7434 has completed 7435 before following 7436 global memory 7437 operations. This 7438 satisfies the 7439 requirements of 7440 acquire. 7441 - Ensures that all 7442 previous memory 7443 operations have 7444 completed before a 7445 following 7446 local/generic store 7447 atomic/atomicrmw 7448 with an equal or 7449 wider sync scope 7450 and memory ordering 7451 stronger than 7452 unordered (this is 7453 termed the 7454 release-fence-paired-atomic). 7455 This satisfies the 7456 requirements of 7457 release. 7458 - Must happen before 7459 the following 7460 buffer_gl0_inv. 7461 - Ensures that the 7462 acquire-fence-paired 7463 atomic has completed 7464 before invalidating 7465 the 7466 cache. Therefore 7467 any following 7468 locations read must 7469 be no older than 7470 the value read by 7471 the 7472 acquire-fence-paired-atomic. 7473 7474 3. buffer_gl0_inv 7475 7476 - If CU wavefront execution 7477 mode, omit. 7478 - Ensures that 7479 following 7480 loads will not see 7481 stale data. 7482 7483 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 7484 - system vmcnt(0) & vscnt(0) 7485 7486 - If OpenCL and 7487 address space is 7488 not generic, omit 7489 lgkmcnt(0). 7490 - If OpenCL and 7491 address space is 7492 local, omit 7493 vmcnt(0) and vscnt(0). 7494 - However, since LLVM 7495 currently has no 7496 address space on 7497 the fence need to 7498 conservatively 7499 always generate 7500 (see comment for 7501 previous fence). 7502 - Could be split into 7503 separate s_waitcnt 7504 vmcnt(0), s_waitcnt 7505 vscnt(0) and s_waitcnt 7506 lgkmcnt(0) to allow 7507 them to be 7508 independently moved 7509 according to the 7510 following rules. 7511 - s_waitcnt vmcnt(0) 7512 must happen after 7513 any preceding 7514 global/generic 7515 load/load 7516 atomic/ 7517 atomicrmw-with-return-value. 7518 - s_waitcnt vscnt(0) 7519 must happen after 7520 any preceding 7521 global/generic 7522 store/store atomic/ 7523 atomicrmw-no-return-value. 7524 - s_waitcnt lgkmcnt(0) 7525 must happen after 7526 any preceding 7527 local/generic 7528 load/store/load 7529 atomic/store 7530 atomic/atomicrmw. 7531 - Must happen before 7532 the following 7533 buffer_gl*_inv. 7534 - Ensures that the 7535 preceding 7536 global/local/generic 7537 load 7538 atomic/atomicrmw 7539 with an equal or 7540 wider sync scope 7541 and memory ordering 7542 stronger than 7543 unordered (this is 7544 termed the 7545 acquire-fence-paired-atomic) 7546 has completed 7547 before invalidating 7548 the caches. This 7549 satisfies the 7550 requirements of 7551 acquire. 7552 - Ensures that all 7553 previous memory 7554 operations have 7555 completed before a 7556 following 7557 global/local/generic 7558 store 7559 atomic/atomicrmw 7560 with an equal or 7561 wider sync scope 7562 and memory ordering 7563 stronger than 7564 unordered (this is 7565 termed the 7566 release-fence-paired-atomic). 7567 This satisfies the 7568 requirements of 7569 release. 7570 7571 2. buffer_gl0_inv; 7572 buffer_gl1_inv 7573 7574 - Must happen before 7575 any following 7576 global/generic 7577 load/load 7578 atomic/store/store 7579 atomic/atomicrmw. 7580 - Ensures that 7581 following loads 7582 will not see stale 7583 global data. This 7584 satisfies the 7585 requirements of 7586 acquire. 7587 7588 **Sequential Consistent Atomic** 7589 ------------------------------------------------------------------------------------ 7590 load atomic seq_cst - singlethread - global *Same as corresponding 7591 - wavefront - local load atomic acquire, 7592 - generic except must generated 7593 all instructions even 7594 for OpenCL.* 7595 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & 7596 - generic vmcnt(0) & vscnt(0) 7597 7598 - If CU wavefront execution 7599 mode, omit vmcnt(0) and 7600 vscnt(0). 7601 - Could be split into 7602 separate s_waitcnt 7603 vmcnt(0), s_waitcnt 7604 vscnt(0), and s_waitcnt 7605 lgkmcnt(0) to allow 7606 them to be 7607 independently moved 7608 according to the 7609 following rules. 7610 - s_waitcnt lgkmcnt(0) must 7611 happen after 7612 preceding 7613 local/generic load 7614 atomic/store 7615 atomic/atomicrmw 7616 with memory 7617 ordering of seq_cst 7618 and with equal or 7619 wider sync scope. 7620 (Note that seq_cst 7621 fences have their 7622 own s_waitcnt 7623 lgkmcnt(0) and so do 7624 not need to be 7625 considered.) 7626 - s_waitcnt vmcnt(0) 7627 must happen after 7628 preceding 7629 global/generic load 7630 atomic/ 7631 atomicrmw-with-return-value 7632 with memory 7633 ordering of seq_cst 7634 and with equal or 7635 wider sync scope. 7636 (Note that seq_cst 7637 fences have their 7638 own s_waitcnt 7639 vmcnt(0) and so do 7640 not need to be 7641 considered.) 7642 - s_waitcnt vscnt(0) 7643 Must happen after 7644 preceding 7645 global/generic store 7646 atomic/ 7647 atomicrmw-no-return-value 7648 with memory 7649 ordering of seq_cst 7650 and with equal or 7651 wider sync scope. 7652 (Note that seq_cst 7653 fences have their 7654 own s_waitcnt 7655 vscnt(0) and so do 7656 not need to be 7657 considered.) 7658 - Ensures any 7659 preceding 7660 sequential 7661 consistent global/local 7662 memory instructions 7663 have completed 7664 before executing 7665 this sequentially 7666 consistent 7667 instruction. This 7668 prevents reordering 7669 a seq_cst store 7670 followed by a 7671 seq_cst load. (Note 7672 that seq_cst is 7673 stronger than 7674 acquire/release as 7675 the reordering of 7676 load acquire 7677 followed by a store 7678 release is 7679 prevented by the 7680 s_waitcnt of 7681 the release, but 7682 there is nothing 7683 preventing a store 7684 release followed by 7685 load acquire from 7686 completing out of 7687 order. The s_waitcnt 7688 could be placed after 7689 seq_store or before 7690 the seq_load. We 7691 choose the load to 7692 make the s_waitcnt be 7693 as late as possible 7694 so that the store 7695 may have already 7696 completed.) 7697 7698 2. *Following 7699 instructions same as 7700 corresponding load 7701 atomic acquire, 7702 except must generated 7703 all instructions even 7704 for OpenCL.* 7705 load atomic seq_cst - workgroup - local 7706 7707 1. s_waitcnt vmcnt(0) & vscnt(0) 7708 7709 - If CU wavefront execution 7710 mode, omit. 7711 - Could be split into 7712 separate s_waitcnt 7713 vmcnt(0) and s_waitcnt 7714 vscnt(0) to allow 7715 them to be 7716 independently moved 7717 according to the 7718 following rules. 7719 - s_waitcnt vmcnt(0) 7720 Must happen after 7721 preceding 7722 global/generic load 7723 atomic/ 7724 atomicrmw-with-return-value 7725 with memory 7726 ordering of seq_cst 7727 and with equal or 7728 wider sync scope. 7729 (Note that seq_cst 7730 fences have their 7731 own s_waitcnt 7732 vmcnt(0) and so do 7733 not need to be 7734 considered.) 7735 - s_waitcnt vscnt(0) 7736 Must happen after 7737 preceding 7738 global/generic store 7739 atomic/ 7740 atomicrmw-no-return-value 7741 with memory 7742 ordering of seq_cst 7743 and with equal or 7744 wider sync scope. 7745 (Note that seq_cst 7746 fences have their 7747 own s_waitcnt 7748 vscnt(0) and so do 7749 not need to be 7750 considered.) 7751 - Ensures any 7752 preceding 7753 sequential 7754 consistent global 7755 memory instructions 7756 have completed 7757 before executing 7758 this sequentially 7759 consistent 7760 instruction. This 7761 prevents reordering 7762 a seq_cst store 7763 followed by a 7764 seq_cst load. (Note 7765 that seq_cst is 7766 stronger than 7767 acquire/release as 7768 the reordering of 7769 load acquire 7770 followed by a store 7771 release is 7772 prevented by the 7773 s_waitcnt of 7774 the release, but 7775 there is nothing 7776 preventing a store 7777 release followed by 7778 load acquire from 7779 completing out of 7780 order. The s_waitcnt 7781 could be placed after 7782 seq_store or before 7783 the seq_load. We 7784 choose the load to 7785 make the s_waitcnt be 7786 as late as possible 7787 so that the store 7788 may have already 7789 completed.) 7790 7791 2. *Following 7792 instructions same as 7793 corresponding load 7794 atomic acquire, 7795 except must generated 7796 all instructions even 7797 for OpenCL.* 7798 7799 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 7800 - system - generic vmcnt(0) & vscnt(0) 7801 7802 - Could be split into 7803 separate s_waitcnt 7804 vmcnt(0), s_waitcnt 7805 vscnt(0) and s_waitcnt 7806 lgkmcnt(0) to allow 7807 them to be 7808 independently moved 7809 according to the 7810 following rules. 7811 - s_waitcnt lgkmcnt(0) 7812 must happen after 7813 preceding 7814 local load 7815 atomic/store 7816 atomic/atomicrmw 7817 with memory 7818 ordering of seq_cst 7819 and with equal or 7820 wider sync scope. 7821 (Note that seq_cst 7822 fences have their 7823 own s_waitcnt 7824 lgkmcnt(0) and so do 7825 not need to be 7826 considered.) 7827 - s_waitcnt vmcnt(0) 7828 must happen after 7829 preceding 7830 global/generic load 7831 atomic/ 7832 atomicrmw-with-return-value 7833 with memory 7834 ordering of seq_cst 7835 and with equal or 7836 wider sync scope. 7837 (Note that seq_cst 7838 fences have their 7839 own s_waitcnt 7840 vmcnt(0) and so do 7841 not need to be 7842 considered.) 7843 - s_waitcnt vscnt(0) 7844 Must happen after 7845 preceding 7846 global/generic store 7847 atomic/ 7848 atomicrmw-no-return-value 7849 with memory 7850 ordering of seq_cst 7851 and with equal or 7852 wider sync scope. 7853 (Note that seq_cst 7854 fences have their 7855 own s_waitcnt 7856 vscnt(0) and so do 7857 not need to be 7858 considered.) 7859 - Ensures any 7860 preceding 7861 sequential 7862 consistent global 7863 memory instructions 7864 have completed 7865 before executing 7866 this sequentially 7867 consistent 7868 instruction. This 7869 prevents reordering 7870 a seq_cst store 7871 followed by a 7872 seq_cst load. (Note 7873 that seq_cst is 7874 stronger than 7875 acquire/release as 7876 the reordering of 7877 load acquire 7878 followed by a store 7879 release is 7880 prevented by the 7881 s_waitcnt of 7882 the release, but 7883 there is nothing 7884 preventing a store 7885 release followed by 7886 load acquire from 7887 completing out of 7888 order. The s_waitcnt 7889 could be placed after 7890 seq_store or before 7891 the seq_load. We 7892 choose the load to 7893 make the s_waitcnt be 7894 as late as possible 7895 so that the store 7896 may have already 7897 completed.) 7898 7899 2. *Following 7900 instructions same as 7901 corresponding load 7902 atomic acquire, 7903 except must generated 7904 all instructions even 7905 for OpenCL.* 7906 store atomic seq_cst - singlethread - global *Same as corresponding 7907 - wavefront - local store atomic release, 7908 - workgroup - generic except must generated 7909 - agent all instructions even 7910 - system for OpenCL.* 7911 atomicrmw seq_cst - singlethread - global *Same as corresponding 7912 - wavefront - local atomicrmw acq_rel, 7913 - workgroup - generic except must generated 7914 - agent all instructions even 7915 - system for OpenCL.* 7916 fence seq_cst - singlethread *none* *Same as corresponding 7917 - wavefront fence acq_rel, 7918 - workgroup except must generated 7919 - agent all instructions even 7920 - system for OpenCL.* 7921 ============ ============ ============== ========== ================================ 7922 7923Trap Handler ABI 7924~~~~~~~~~~~~~~~~ 7925 7926For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible 7927runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that 7928supports the ``s_trap`` instruction. For usage see: 7929 7930- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table` 7931- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table` 7932- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table` 7933 7934 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2 7935 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table 7936 7937 =================== =============== =============== ======================================= 7938 Usage Code Sequence Trap Handler Description 7939 Inputs 7940 =================== =============== =============== ======================================= 7941 reserved ``s_trap 0x00`` Reserved by hardware. 7942 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap`` 7943 ``queue_ptr`` intrinsic (not implemented). 7944 ``VGPR0``: 7945 ``arg`` 7946 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 7947 ``queue_ptr`` the trap instruction. The associated 7948 queue is signalled to put it into the 7949 error state. When the queue is put in 7950 the error state, the waves executing 7951 dispatches on the queue will be 7952 terminated. 7953 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 7954 as a no-operation. The trap handler 7955 is entered and immediately returns to 7956 continue execution of the wavefront. 7957 - If the debugger is enabled, causes 7958 the debug trap to be reported by the 7959 debugger and the wavefront is put in 7960 the halt state with the PC at the 7961 instruction. The debugger must 7962 increment the PC and resume the wave. 7963 reserved ``s_trap 0x04`` Reserved. 7964 reserved ``s_trap 0x05`` Reserved. 7965 reserved ``s_trap 0x06`` Reserved. 7966 reserved ``s_trap 0x07`` Reserved. 7967 reserved ``s_trap 0x08`` Reserved. 7968 reserved ``s_trap 0xfe`` Reserved. 7969 reserved ``s_trap 0xff`` Reserved. 7970 =================== =============== =============== ======================================= 7971 7972.. 7973 7974 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3 7975 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table 7976 7977 =================== =============== =============== ======================================= 7978 Usage Code Sequence Trap Handler Description 7979 Inputs 7980 =================== =============== =============== ======================================= 7981 reserved ``s_trap 0x00`` Reserved by hardware. 7982 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for 7983 breakpoints. Causes wave to be halted 7984 with the PC at the trap instruction. 7985 The debugger is responsible to resume 7986 the wave, including the instruction 7987 that the breakpoint overwrote. 7988 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at 7989 ``queue_ptr`` the trap instruction. The associated 7990 queue is signalled to put it into the 7991 error state. When the queue is put in 7992 the error state, the waves executing 7993 dispatches on the queue will be 7994 terminated. 7995 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves 7996 as a no-operation. The trap handler 7997 is entered and immediately returns to 7998 continue execution of the wavefront. 7999 - If the debugger is enabled, causes 8000 the debug trap to be reported by the 8001 debugger and the wavefront is put in 8002 the halt state with the PC at the 8003 instruction. The debugger must 8004 increment the PC and resume the wave. 8005 reserved ``s_trap 0x04`` Reserved. 8006 reserved ``s_trap 0x05`` Reserved. 8007 reserved ``s_trap 0x06`` Reserved. 8008 reserved ``s_trap 0x07`` Reserved. 8009 reserved ``s_trap 0x08`` Reserved. 8010 reserved ``s_trap 0xfe`` Reserved. 8011 reserved ``s_trap 0xff`` Reserved. 8012 =================== =============== =============== ======================================= 8013 8014.. 8015 8016 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 8017 :name: amdgpu-trap-handler-for-amdhsa-os-v4-table 8018 8019 =================== =============== =============== ============== ======================================= 8020 Usage Code Sequence GFX6-8 Inputs GFX9-10 Inputs Description 8021 =================== =============== =============== ============== ======================================= 8022 reserved ``s_trap 0x00`` Reserved by hardware. 8023 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for 8024 breakpoints. Causes wave to be halted 8025 with the PC at the trap instruction. 8026 The debugger is responsible to resume 8027 the wave, including the instruction 8028 that the breakpoint overwrote. 8029 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at 8030 ``queue_ptr`` the trap instruction. The associated 8031 queue is signalled to put it into the 8032 error state. When the queue is put in 8033 the error state, the waves executing 8034 dispatches on the queue will be 8035 terminated. 8036 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves 8037 as a no-operation. The trap handler 8038 is entered and immediately returns to 8039 continue execution of the wavefront. 8040 - If the debugger is enabled, causes 8041 the debug trap to be reported by the 8042 debugger and the wavefront is put in 8043 the halt state with the PC at the 8044 instruction. The debugger must 8045 increment the PC and resume the wave. 8046 reserved ``s_trap 0x04`` Reserved. 8047 reserved ``s_trap 0x05`` Reserved. 8048 reserved ``s_trap 0x06`` Reserved. 8049 reserved ``s_trap 0x07`` Reserved. 8050 reserved ``s_trap 0x08`` Reserved. 8051 reserved ``s_trap 0xfe`` Reserved. 8052 reserved ``s_trap 0xff`` Reserved. 8053 =================== =============== =============== ============== ======================================= 8054 8055.. _amdgpu-amdhsa-function-call-convention: 8056 8057Call Convention 8058~~~~~~~~~~~~~~~ 8059 8060.. note:: 8061 8062 This section is currently incomplete and has inaccuracies. It is WIP that will 8063 be updated as information is determined. 8064 8065See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled 8066addresses. Unswizzled addresses are normal linear addresses. 8067 8068.. _amdgpu-amdhsa-function-call-convention-kernel-functions: 8069 8070Kernel Functions 8071++++++++++++++++ 8072 8073This section describes the call convention ABI for the outer kernel function. 8074 8075See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call 8076convention. 8077 8078The following is not part of the AMDGPU kernel calling convention but describes 8079how the AMDGPU implements function calls: 8080 80811. Clang decides the kernarg layout to match the *HSA Programmer's Language 8082 Reference* [HSA]_. 8083 8084 - All structs are passed directly. 8085 - Lambda values are passed *TBA*. 8086 8087 .. TODO:: 8088 8089 - Does this really follow HSA rules? Or are structs >16 bytes passed 8090 by-value struct? 8091 - What is ABI for lambda values? 8092 80934. The kernel performs certain setup in its prolog, as described in 8094 :ref:`amdgpu-amdhsa-kernel-prolog`. 8095 8096.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: 8097 8098Non-Kernel Functions 8099++++++++++++++++++++ 8100 8101This section describes the call convention ABI for functions other than the 8102outer kernel function. 8103 8104If a kernel has function calls then scratch is always allocated and used for 8105the call stack which grows from low address to high address using the swizzled 8106scratch address space. 8107 8108On entry to a function: 8109 81101. SGPR0-3 contain a V# with the following properties (see 8111 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): 8112 8113 * Base address pointing to the beginning of the wavefront scratch backing 8114 memory. 8115 * Swizzled with dword element size and stride of wavefront size elements. 8116 81172. The FLAT_SCRATCH register pair is setup. See 8118 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. 81193. GFX6-8: M0 register set to the size of LDS in bytes. See 8120 :ref:`amdgpu-amdhsa-kernel-prolog-m0`. 81214. The EXEC register is set to the lanes active on entry to the function. 81225. MODE register: *TBD* 81236. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described 8124 below. 81257. SGPR30-31 return address (RA). The code address that the function must 8126 return to when it completes. The value is undefined if the function is *no 8127 return*. 81288. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch 8129 offset relative to the beginning of the wavefront scratch backing memory. 8130 8131 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR 8132 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled 8133 manner. 8134 8135 The unswizzled SP value can be converted into the swizzled SP value by: 8136 8137 | swizzled SP = unswizzled SP / wavefront size 8138 8139 This may be used to obtain the private address space address of stack 8140 objects and to convert this address to a flat address by adding the flat 8141 scratch aperture base address. 8142 8143 The swizzled SP value is always 4 bytes aligned for the ``r600`` 8144 architecture and 16 byte aligned for the ``amdgcn`` architecture. 8145 8146 .. note:: 8147 8148 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the 8149 OpenCL language which has the largest base type defined as 16 bytes. 8150 8151 On entry, the swizzled SP value is the address of the first function 8152 argument passed on the stack. Other stack passed arguments are positive 8153 offsets from the entry swizzled SP value. 8154 8155 The function may use positive offsets beyond the last stack passed argument 8156 for stack allocated local variables and register spill slots. If necessary, 8157 the function may align these to greater alignment than 16 bytes. After these 8158 the function may dynamically allocate space for such things as runtime sized 8159 ``alloca`` local allocations. 8160 8161 If the function calls another function, it will place any stack allocated 8162 arguments after the last local allocation and adjust SGPR32 to the address 8163 after the last local allocation. 8164 81659. All other registers are unspecified. 816610. Any necessary ``s_waitcnt`` has been performed to ensure memory is available 8167 to the function. 8168 8169On exit from a function: 8170 81711. VGPR0-31 and SGPR4-29 are used to pass function result arguments as 8172 described below. Any registers used are considered clobbered registers. 81732. The following registers are preserved and have the same value as on entry: 8174 8175 * FLAT_SCRATCH 8176 * EXEC 8177 * GFX6-8: M0 8178 * All SGPR registers except the clobbered registers of SGPR4-31. 8179 * VGPR40-47 8180 VGPR56-63 8181 VGPR72-79 8182 VGPR88-95 8183 VGPR104-111 8184 VGPR120-127 8185 VGPR136-143 8186 VGPR152-159 8187 VGPR168-175 8188 VGPR184-191 8189 VGPR200-207 8190 VGPR216-223 8191 VGPR232-239 8192 VGPR248-255 8193 8194 *Except the argument registers, the VGPR clobbered and the preserved 8195 registers are intermixed at regular intervals in order to 8196 get a better occupancy.* 8197 8198 For the AMDGPU backend, an inter-procedural register allocation (IPRA) 8199 optimization may mark some of clobbered SGPR and VGPR registers as 8200 preserved if it can be determined that the called function does not change 8201 their value. 8202 82032. The PC is set to the RA provided on entry. 82043. MODE register: *TBD*. 82054. All other registers are clobbered. 82065. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by 8207 function is available to the caller. 8208 8209.. TODO:: 8210 8211 - On gfx908 are all ACC registers clobbered? 8212 8213 - How are function results returned? The address of structured types is passed 8214 by reference, but what about other types? 8215 8216The function input arguments are made up of the formal arguments explicitly 8217declared by the source language function plus the implicit input arguments used 8218by the implementation. 8219 8220The source language input arguments are: 8221 82221. Any source language implicit ``this`` or ``self`` argument comes first as a 8223 pointer type. 82242. Followed by the function formal arguments in left to right source order. 8225 8226The source language result arguments are: 8227 82281. The function result argument. 8229 8230The source language input or result struct type arguments that are less than or 8231equal to 16 bytes, are decomposed recursively into their base type fields, and 8232each field is passed as if a separate argument. For input arguments, if the 8233called function requires the struct to be in memory, for example because its 8234address is taken, then the function body is responsible for allocating a stack 8235location and copying the field arguments into it. Clang terms this *direct 8236struct*. 8237 8238The source language input struct type arguments that are greater than 16 bytes, 8239are passed by reference. The caller is responsible for allocating a stack 8240location to make a copy of the struct value and pass the address as the input 8241argument. The called function is responsible to perform the dereference when 8242accessing the input argument. Clang terms this *by-value struct*. 8243 8244A source language result struct type argument that is greater than 16 bytes, is 8245returned by reference. The caller is responsible for allocating a stack location 8246to hold the result value and passes the address as the last input argument 8247(before the implicit input arguments). In this case there are no result 8248arguments. The called function is responsible to perform the dereference when 8249storing the result value. Clang terms this *structured return (sret)*. 8250 8251*TODO: correct the ``sret`` definition.* 8252 8253.. TODO:: 8254 8255 Is this definition correct? Or is ``sret`` only used if passing in registers, and 8256 pass as non-decomposed struct as stack argument? Or something else? Is the 8257 memory location in the caller stack frame, or a stack memory argument and so 8258 no address is passed as the caller can directly write to the argument stack 8259 location? But then the stack location is still live after return. If an 8260 argument stack location is it the first stack argument or the last one? 8261 8262Lambda argument types are treated as struct types with an implementation defined 8263set of fields. 8264 8265.. TODO:: 8266 8267 Need to specify the ABI for lambda types for AMDGPU. 8268 8269For AMDGPU backend all source language arguments (including the decomposed 8270struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case 8271they are passed in SGPRs. 8272 8273The AMDGPU backend walks the function call graph from the leaves to determine 8274which implicit input arguments are used, propagating to each caller of the 8275function. The used implicit arguments are appended to the function arguments 8276after the source language arguments in the following order: 8277 8278.. TODO:: 8279 8280 Is recursion or external functions supported? 8281 82821. Work-Item ID (1 VGPR) 8283 8284 The X, Y and Z work-item ID are packed into a single VGRP with the following 8285 layout. Only fields actually used by the function are set. The other bits 8286 are undefined. 8287 8288 The values come from the initial kernel execution state. See 8289 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 8290 8291 .. table:: Work-item implicit argument layout 8292 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table 8293 8294 ======= ======= ============== 8295 Bits Size Field Name 8296 ======= ======= ============== 8297 9:0 10 bits X Work-Item ID 8298 19:10 10 bits Y Work-Item ID 8299 29:20 10 bits Z Work-Item ID 8300 31:30 2 bits Unused 8301 ======= ======= ============== 8302 83032. Dispatch Ptr (2 SGPRs) 8304 8305 The value comes from the initial kernel execution state. See 8306 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 8307 83083. Queue Ptr (2 SGPRs) 8309 8310 The value comes from the initial kernel execution state. See 8311 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 8312 83134. Kernarg Segment Ptr (2 SGPRs) 8314 8315 The value comes from the initial kernel execution state. See 8316 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 8317 83185. Dispatch id (2 SGPRs) 8319 8320 The value comes from the initial kernel execution state. See 8321 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 8322 83236. Work-Group ID X (1 SGPR) 8324 8325 The value comes from the initial kernel execution state. See 8326 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 8327 83287. Work-Group ID Y (1 SGPR) 8329 8330 The value comes from the initial kernel execution state. See 8331 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 8332 83338. Work-Group ID Z (1 SGPR) 8334 8335 The value comes from the initial kernel execution state. See 8336 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 8337 83389. Implicit Argument Ptr (2 SGPRs) 8339 8340 The value is computed by adding an offset to Kernarg Segment Ptr to get the 8341 global address space pointer to the first kernarg implicit argument. 8342 8343The input and result arguments are assigned in order in the following manner: 8344 8345.. note:: 8346 8347 There are likely some errors and omissions in the following description that 8348 need correction. 8349 8350 .. TODO:: 8351 8352 Check the Clang source code to decipher how function arguments and return 8353 results are handled. Also see the AMDGPU specific values used. 8354 8355* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to 8356 VGPR31. 8357 8358 If there are more arguments than will fit in these registers, the remaining 8359 arguments are allocated on the stack in order on naturally aligned 8360 addresses. 8361 8362 .. TODO:: 8363 8364 How are overly aligned structures allocated on the stack? 8365 8366* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to 8367 SGPR29. 8368 8369 If there are more arguments than will fit in these registers, the remaining 8370 arguments are allocated on the stack in order on naturally aligned 8371 addresses. 8372 8373Note that decomposed struct type arguments may have some fields passed in 8374registers and some in memory. 8375 8376.. TODO:: 8377 8378 So, a struct which can pass some fields as decomposed register arguments, will 8379 pass the rest as decomposed stack elements? But an argument that will not start 8380 in registers will not be decomposed and will be passed as a non-decomposed 8381 stack value? 8382 8383The following is not part of the AMDGPU function calling convention but 8384describes how the AMDGPU implements function calls: 8385 83861. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an 8387 unswizzled scratch address. It is only needed if runtime sized ``alloca`` 8388 are used, or for the reasons defined in ``SIFrameLowering``. 83892. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) 8390 to access the incoming stack arguments in the function. The BP is needed 8391 only when the function requires the runtime stack alignment. 8392 83933. Allocating SGPR arguments on the stack are not supported. 8394 83954. No CFI is currently generated. See 8396 :ref:`amdgpu-dwarf-call-frame-information`. 8397 8398 .. note:: 8399 8400 CFI will be generated that defines the CFA as the unswizzled address 8401 relative to the wave scratch base in the unswizzled private address space 8402 of the lowest address stack allocated local variable. 8403 8404 ``DW_AT_frame_base`` will be defined as the swizzled address in the 8405 swizzled private address space by dividing the CFA by the wavefront size 8406 (since CFA is always at least dword aligned which matches the scratch 8407 swizzle element size). 8408 8409 If no dynamic stack alignment was performed, the stack allocated arguments 8410 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the 8411 local variables and register spill slots are accessed as positive offsets 8412 relative to ``DW_AT_frame_base``. 8413 84145. Function argument passing is implemented by copying the input physical 8415 registers to virtual registers on entry. The register allocator can spill if 8416 necessary. These are copied back to physical registers at call sites. The 8417 net effect is that each function call can have these values in entirely 8418 distinct locations. The IPRA can help avoid shuffling argument registers. 84196. Call sites are implemented by setting up the arguments at positive offsets 8420 from SP. Then SP is incremented to account for the known frame size before 8421 the call and decremented after the call. 8422 8423 .. note:: 8424 8425 The CFI will reflect the changed calculation needed to compute the CFA 8426 from SP. 8427 84287. 4 byte spill slots are used in the stack frame. One slot is allocated for an 8429 emergency spill slot. Buffer instructions are used for stack accesses and 8430 not the ``flat_scratch`` instruction. 8431 8432 .. TODO:: 8433 8434 Explain when the emergency spill slot is used. 8435 8436.. TODO:: 8437 8438 Possible broken issues: 8439 8440 - Stack arguments must be aligned to required alignment. 8441 - Stack is aligned to max(16, max formal argument alignment) 8442 - Direct argument < 64 bits should check register budget. 8443 - Register budget calculation should respect ``inreg`` for SGPR. 8444 - SGPR overflow is not handled. 8445 - struct with 1 member unpeeling is not checking size of member. 8446 - ``sret`` is after ``this`` pointer. 8447 - Caller is not implementing stack realignment: need an extra pointer. 8448 - Should say AMDGPU passes FP rather than SP. 8449 - Should CFI define CFA as address of locals or arguments. Difference is 8450 apparent when have implemented dynamic alignment. 8451 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be 8452 highest address of stack frame and use negative offset for locals. Would 8453 allow SP to be the same as FP and could support signal-handler-like as now 8454 have a real SP for the top of the stack. 8455 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay 8456 arguments? 8457 8458AMDPAL 8459------ 8460 8461This section provides code conventions used when the target triple OS is 8462``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters 8463from the application/runtime to each invocation of a hardware shader. These 8464parameters include both generic, application-controlled parameters called 8465*user data* as well as system-generated parameters that are a product of the 8466draw or dispatch execution. 8467 8468User Data 8469~~~~~~~~~ 8470 8471Each hardware stage has a set of 32-bit *user data registers* which can be 8472written from a command buffer and then loaded into SGPRs when waves are launched 8473via a subsequent dispatch or draw operation. This is the way most arguments are 8474passed from the application/runtime to a hardware shader. 8475 8476Compute User Data 8477~~~~~~~~~~~~~~~~~ 8478 8479Compute shader user data mappings are simpler than graphics shaders and have a 8480fixed mapping. 8481 8482Note that there are always 10 available *user data entries* in registers - 8483entries beyond that limit must be fetched from memory (via the spill table 8484pointer) by the shader. 8485 8486 .. table:: PAL Compute Shader User Data Registers 8487 :name: pal-compute-user-data-registers 8488 8489 ============= ================================ 8490 User Register Description 8491 ============= ================================ 8492 0 Global Internal Table (32-bit pointer) 8493 1 Per-Shader Internal Table (32-bit pointer) 8494 2 - 11 Application-Controlled User Data (10 32-bit values) 8495 12 Spill Table (32-bit pointer) 8496 13 - 14 Thread Group Count (64-bit pointer) 8497 15 GDS Range 8498 ============= ================================ 8499 8500Graphics User Data 8501~~~~~~~~~~~~~~~~~~ 8502 8503Graphics pipelines support a much more flexible user data mapping: 8504 8505 .. table:: PAL Graphics Shader User Data Registers 8506 :name: pal-graphics-user-data-registers 8507 8508 ============= ================================ 8509 User Register Description 8510 ============= ================================ 8511 0 Global Internal Table (32-bit pointer) 8512 + Per-Shader Internal Table (32-bit pointer) 8513 + 1-15 Application Controlled User Data 8514 (1-15 Contiguous 32-bit Values in Registers) 8515 + Spill Table (32-bit pointer) 8516 + Draw Index (First Stage Only) 8517 + Vertex Offset (First Stage Only) 8518 + Instance Offset (First Stage Only) 8519 ============= ================================ 8520 8521 The placement of the global internal table remains fixed in the first *user 8522 data SGPR register*. Otherwise all parameters are optional, and can be mapped 8523 to any desired *user data SGPR register*, with the following restrictions: 8524 8525 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first 8526 active hardware stage in a graphics pipeline (i.e. where the API vertex 8527 shader runs). 8528 8529 * Application-controlled user data must be mapped into a contiguous range of 8530 user data registers. 8531 8532 * The application-controlled user data range supports compaction remapping, so 8533 only *entries* that are actually consumed by the shader must be assigned to 8534 corresponding *registers*. Note that in order to support an efficient runtime 8535 implementation, the remapping must pack *registers* in the same order as 8536 *entries*, with unused *entries* removed. 8537 8538.. _pal_global_internal_table: 8539 8540Global Internal Table 8541~~~~~~~~~~~~~~~~~~~~~ 8542 8543The global internal table is a table of *shader resource descriptors* (SRDs) 8544that define how certain engine-wide, runtime-managed resources should be 8545accessed from a shader. The majority of these resources have HW-defined formats, 8546and it is up to the compiler to write/read data as required by the target 8547hardware. 8548 8549The following table illustrates the required format: 8550 8551 .. table:: PAL Global Internal Table 8552 :name: pal-git-table 8553 8554 ============= ================================ 8555 Offset Description 8556 ============= ================================ 8557 0-3 Graphics Scratch SRD 8558 4-7 Compute Scratch SRD 8559 8-11 ES/GS Ring Output SRD 8560 12-15 ES/GS Ring Input SRD 8561 16-19 GS/VS Ring Output #0 8562 20-23 GS/VS Ring Output #1 8563 24-27 GS/VS Ring Output #2 8564 28-31 GS/VS Ring Output #3 8565 32-35 GS/VS Ring Input SRD 8566 36-39 Tessellation Factor Buffer SRD 8567 40-43 Off-Chip LDS Buffer SRD 8568 44-47 Off-Chip Param Cache Buffer SRD 8569 48-51 Sample Position Buffer SRD 8570 52 vaRange::ShadowDescriptorTable High Bits 8571 ============= ================================ 8572 8573 The pointer to the global internal table passed to the shader as user data 8574 is a 32-bit pointer. The top 32 bits should be assumed to be the same as 8575 the top 32 bits of the pipeline, so the shader may use the program 8576 counter's top 32 bits. 8577 8578.. _pal_call-convention: 8579 8580Call Convention 8581~~~~~~~~~~~~~~~ 8582 8583For graphics use cases, the calling convention is `amdgpu_gfx`. 8584 8585.. note:: 8586 8587 `amdgpu_gfx` Function calls are currently in development and are 8588 subject to major changes. 8589 8590This calling convention shares most properties with calling non-kernel 8591functions (see 8592:ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions`). 8593Differences are: 8594 8595 - Currently there are none, differences will be listed here 8596 8597Unspecified OS 8598-------------- 8599 8600This section provides code conventions used when the target triple OS is 8601empty (see :ref:`amdgpu-target-triples`). 8602 8603Trap Handler ABI 8604~~~~~~~~~~~~~~~~ 8605 8606For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 8607not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 8608instructions are handled as follows: 8609 8610 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 8611 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 8612 8613 =============== =============== =========================================== 8614 Usage Code Sequence Description 8615 =============== =============== =========================================== 8616 llvm.trap s_endpgm Causes wavefront to be terminated. 8617 llvm.debugtrap *none* Compiler warning given that there is no 8618 trap handler installed. 8619 =============== =============== =========================================== 8620 8621Source Languages 8622================ 8623 8624.. _amdgpu-opencl: 8625 8626OpenCL 8627------ 8628 8629When the language is OpenCL the following differences occur: 8630 86311. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 86322. The AMDGPU backend appends additional arguments to the kernel's explicit 8633 arguments for the AMDHSA OS (see 8634 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 86353. Additional metadata is generated 8636 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 8637 8638 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 8639 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 8640 8641 ======== ==== ========= =========================================== 8642 Position Byte Byte Description 8643 Size Alignment 8644 ======== ==== ========= =========================================== 8645 1 8 8 OpenCL Global Offset X 8646 2 8 8 OpenCL Global Offset Y 8647 3 8 8 OpenCL Global Offset Z 8648 4 8 8 OpenCL address of printf buffer 8649 5 8 8 OpenCL address of virtual queue used by 8650 enqueue_kernel. 8651 6 8 8 OpenCL address of AqlWrap struct used by 8652 enqueue_kernel. 8653 7 8 8 Pointer argument used for Multi-gird 8654 synchronization. 8655 ======== ==== ========= =========================================== 8656 8657.. _amdgpu-hcc: 8658 8659HCC 8660--- 8661 8662When the language is HCC the following differences occur: 8663 86641. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 8665 8666.. _amdgpu-assembler: 8667 8668Assembler 8669--------- 8670 8671AMDGPU backend has LLVM-MC based assembler which is currently in development. 8672It supports AMDGCN GFX6-GFX10. 8673 8674This section describes general syntax for instructions and operands. 8675 8676Instructions 8677~~~~~~~~~~~~ 8678 8679An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`: 8680 8681 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... 8682 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` 8683 8684:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while 8685:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated. 8686 8687The order of operands and modifiers is fixed. 8688Most modifiers are optional and may be omitted. 8689 8690Links to detailed instruction syntax description may be found in the following 8691table. Note that features under development are not included 8692in this description. 8693 8694 =================================== ======================================= 8695 Core ISA ISA Extensions 8696 =================================== ======================================= 8697 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \- 8698 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \- 8699 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>` 8700 8701 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>` 8702 8703 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>` 8704 8705 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>` 8706 8707 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>` 8708 8709 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>` 8710 8711 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>` 8712 8713 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>` 8714 =================================== ======================================= 8715 8716For more information about instructions, their semantics and supported 8717combinations of operands, refer to one of instruction set architecture manuals 8718[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and 8719[AMD-GCN-GFX10]_. 8720 8721Operands 8722~~~~~~~~ 8723 8724Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`. 8725 8726Modifiers 8727~~~~~~~~~ 8728 8729Detailed description of modifiers may be found 8730:doc:`here<AMDGPUModifierSyntax>`. 8731 8732Instruction Examples 8733~~~~~~~~~~~~~~~~~~~~ 8734 8735DS 8736++ 8737 8738.. code-block:: nasm 8739 8740 ds_add_u32 v2, v4 offset:16 8741 ds_write_src2_b64 v2 offset0:4 offset1:8 8742 ds_cmpst_f32 v2, v4, v6 8743 ds_min_rtn_f64 v[8:9], v2, v[4:5] 8744 8745For full list of supported instructions, refer to "LDS/GDS instructions" in ISA 8746Manual. 8747 8748FLAT 8749++++ 8750 8751.. code-block:: nasm 8752 8753 flat_load_dword v1, v[3:4] 8754 flat_store_dwordx3 v[3:4], v[5:7] 8755 flat_atomic_swap v1, v[3:4], v5 glc 8756 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 8757 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 8758 8759For full list of supported instructions, refer to "FLAT instructions" in ISA 8760Manual. 8761 8762MUBUF 8763+++++ 8764 8765.. code-block:: nasm 8766 8767 buffer_load_dword v1, off, s[4:7], s1 8768 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 8769 buffer_store_format_xy v[1:2], off, s[4:7], s1 8770 buffer_wbinvl1 8771 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 8772 8773For full list of supported instructions, refer to "MUBUF Instructions" in ISA 8774Manual. 8775 8776SMRD/SMEM 8777+++++++++ 8778 8779.. code-block:: nasm 8780 8781 s_load_dword s1, s[2:3], 0xfc 8782 s_load_dwordx8 s[8:15], s[2:3], s4 8783 s_load_dwordx16 s[88:103], s[2:3], s4 8784 s_dcache_inv_vol 8785 s_memtime s[4:5] 8786 8787For full list of supported instructions, refer to "Scalar Memory Operations" in 8788ISA Manual. 8789 8790SOP1 8791++++ 8792 8793.. code-block:: nasm 8794 8795 s_mov_b32 s1, s2 8796 s_mov_b64 s[0:1], 0x80000000 8797 s_cmov_b32 s1, 200 8798 s_wqm_b64 s[2:3], s[4:5] 8799 s_bcnt0_i32_b64 s1, s[2:3] 8800 s_swappc_b64 s[2:3], s[4:5] 8801 s_cbranch_join s[4:5] 8802 8803For full list of supported instructions, refer to "SOP1 Instructions" in ISA 8804Manual. 8805 8806SOP2 8807++++ 8808 8809.. code-block:: nasm 8810 8811 s_add_u32 s1, s2, s3 8812 s_and_b64 s[2:3], s[4:5], s[6:7] 8813 s_cselect_b32 s1, s2, s3 8814 s_andn2_b32 s2, s4, s6 8815 s_lshr_b64 s[2:3], s[4:5], s6 8816 s_ashr_i32 s2, s4, s6 8817 s_bfm_b64 s[2:3], s4, s6 8818 s_bfe_i64 s[2:3], s[4:5], s6 8819 s_cbranch_g_fork s[4:5], s[6:7] 8820 8821For full list of supported instructions, refer to "SOP2 Instructions" in ISA 8822Manual. 8823 8824SOPC 8825++++ 8826 8827.. code-block:: nasm 8828 8829 s_cmp_eq_i32 s1, s2 8830 s_bitcmp1_b32 s1, s2 8831 s_bitcmp0_b64 s[2:3], s4 8832 s_setvskip s3, s5 8833 8834For full list of supported instructions, refer to "SOPC Instructions" in ISA 8835Manual. 8836 8837SOPP 8838++++ 8839 8840.. code-block:: nasm 8841 8842 s_barrier 8843 s_nop 2 8844 s_endpgm 8845 s_waitcnt 0 ; Wait for all counters to be 0 8846 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 8847 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 8848 s_sethalt 9 8849 s_sleep 10 8850 s_sendmsg 0x1 8851 s_sendmsg sendmsg(MSG_INTERRUPT) 8852 s_trap 1 8853 8854For full list of supported instructions, refer to "SOPP Instructions" in ISA 8855Manual. 8856 8857Unless otherwise mentioned, little verification is performed on the operands 8858of SOPP Instructions, so it is up to the programmer to be familiar with the 8859range or acceptable values. 8860 8861VALU 8862++++ 8863 8864For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 8865the assembler will automatically use optimal encoding based on its operands. To 8866force specific encoding, one can add a suffix to the opcode of the instruction: 8867 8868* _e32 for 32-bit VOP1/VOP2/VOPC 8869* _e64 for 64-bit VOP3 8870* _dpp for VOP_DPP 8871* _sdwa for VOP_SDWA 8872 8873VOP1/VOP2/VOP3/VOPC examples: 8874 8875.. code-block:: nasm 8876 8877 v_mov_b32 v1, v2 8878 v_mov_b32_e32 v1, v2 8879 v_nop 8880 v_cvt_f64_i32_e32 v[1:2], v2 8881 v_floor_f32_e32 v1, v2 8882 v_bfrev_b32_e32 v1, v2 8883 v_add_f32_e32 v1, v2, v3 8884 v_mul_i32_i24_e64 v1, v2, 3 8885 v_mul_i32_i24_e32 v1, -3, v3 8886 v_mul_i32_i24_e32 v1, -100, v3 8887 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 8888 v_max_f16_e32 v1, v2, v3 8889 8890VOP_DPP examples: 8891 8892.. code-block:: nasm 8893 8894 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 8895 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 8896 v_mov_b32 v0, v0 wave_shl:1 8897 v_mov_b32 v0, v0 row_mirror 8898 v_mov_b32 v0, v0 row_bcast:31 8899 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 8900 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 8901 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 8902 8903VOP_SDWA examples: 8904 8905.. code-block:: nasm 8906 8907 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 8908 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 8909 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 8910 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 8911 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 8912 8913For full list of supported instructions, refer to "Vector ALU instructions". 8914 8915.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: 8916 8917Code Object V2 Predefined Symbols 8918~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8919 8920.. warning:: 8921 Code object V2 is not the default code object version emitted by 8922 this version of LLVM. 8923 8924The AMDGPU assembler defines and updates some symbols automatically. These 8925symbols do not affect code generation. 8926 8927.option.machine_version_major 8928+++++++++++++++++++++++++++++ 8929 8930Set to the GFX major generation number of the target being assembled for. For 8931example, when assembling for a "GFX9" target this will be set to the integer 8932value "9". The possible GFX major generation numbers are presented in 8933:ref:`amdgpu-processors`. 8934 8935.option.machine_version_minor 8936+++++++++++++++++++++++++++++ 8937 8938Set to the GFX minor generation number of the target being assembled for. For 8939example, when assembling for a "GFX810" target this will be set to the integer 8940value "1". The possible GFX minor generation numbers are presented in 8941:ref:`amdgpu-processors`. 8942 8943.option.machine_version_stepping 8944++++++++++++++++++++++++++++++++ 8945 8946Set to the GFX stepping generation number of the target being assembled for. 8947For example, when assembling for a "GFX704" target this will be set to the 8948integer value "4". The possible GFX stepping generation numbers are presented 8949in :ref:`amdgpu-processors`. 8950 8951.kernel.vgpr_count 8952++++++++++++++++++ 8953 8954Set to zero each time a 8955:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 8956encountered. At each instruction, if the current value of this symbol is less 8957than or equal to the maximum VGPR number explicitly referenced within that 8958instruction then the symbol value is updated to equal that VGPR number plus 8959one. 8960 8961.kernel.sgpr_count 8962++++++++++++++++++ 8963 8964Set to zero each time a 8965:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is 8966encountered. At each instruction, if the current value of this symbol is less 8967than or equal to the maximum VGPR number explicitly referenced within that 8968instruction then the symbol value is updated to equal that SGPR number plus 8969one. 8970 8971.. _amdgpu-amdhsa-assembler-directives-v2: 8972 8973Code Object V2 Directives 8974~~~~~~~~~~~~~~~~~~~~~~~~~ 8975 8976.. warning:: 8977 Code object V2 is not the default code object version emitted by 8978 this version of LLVM. 8979 8980AMDGPU ABI defines auxiliary data in output code object. In assembly source, 8981one can specify them with assembler directives. 8982 8983.hsa_code_object_version major, minor 8984+++++++++++++++++++++++++++++++++++++ 8985 8986*major* and *minor* are integers that specify the version of the HSA code 8987object that will be generated by the assembler. 8988 8989.hsa_code_object_isa [major, minor, stepping, vendor, arch] 8990+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 8991 8992 8993*major*, *minor*, and *stepping* are all integers that describe the instruction 8994set architecture (ISA) version of the assembly program. 8995 8996*vendor* and *arch* are quoted strings. *vendor* should always be equal to 8997"AMD" and *arch* should always be equal to "AMDGPU". 8998 8999By default, the assembler will derive the ISA version, *vendor*, and *arch* 9000from the value of the -mcpu option that is passed to the assembler. 9001 9002.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: 9003 9004.amdgpu_hsa_kernel (name) 9005+++++++++++++++++++++++++ 9006 9007This directives specifies that the symbol with given name is a kernel entry 9008point (label) and the object should contain corresponding symbol of type 9009STT_AMDGPU_HSA_KERNEL. 9010 9011.amd_kernel_code_t 9012++++++++++++++++++ 9013 9014This directive marks the beginning of a list of key / value pairs that are used 9015to specify the amd_kernel_code_t object that will be emitted by the assembler. 9016The list must be terminated by the *.end_amd_kernel_code_t* directive. For any 9017amd_kernel_code_t values that are unspecified a default value will be used. The 9018default value for all keys is 0, with the following exceptions: 9019 9020- *amd_code_version_major* defaults to 1. 9021- *amd_kernel_code_version_minor* defaults to 2. 9022- *amd_machine_kind* defaults to 1. 9023- *amd_machine_version_major*, *machine_version_minor*, and 9024 *amd_machine_version_stepping* are derived from the value of the -mcpu option 9025 that is passed to the assembler. 9026- *kernel_code_entry_byte_offset* defaults to 256. 9027- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards 9028 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. 9029 Note that wavefront size is specified as a power of two, so a value of **n** 9030 means a size of 2^ **n**. 9031- *call_convention* defaults to -1. 9032- *kernarg_segment_alignment*, *group_segment_alignment*, and 9033 *private_segment_alignment* default to 4. Note that alignments are specified 9034 as a power of 2, so a value of **n** means an alignment of 2^ **n**. 9035- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for 9036 GFX10 onwards. 9037- *enable_mem_ordered* defaults to 1 for GFX10 onwards. 9038 9039The *.amd_kernel_code_t* directive must be placed immediately after the 9040function label and before any instructions. 9041 9042For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 9043comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 9044 9045.. _amdgpu-amdhsa-assembler-example-v2: 9046 9047Code Object V2 Example Source Code 9048~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9049 9050.. warning:: 9051 Code Object V2 is not the default code object version emitted by 9052 this version of LLVM. 9053 9054Here is an example of a minimal assembly source file, defining one HSA kernel: 9055 9056.. code:: 9057 :number-lines: 9058 9059 .hsa_code_object_version 1,0 9060 .hsa_code_object_isa 9061 9062 .hsatext 9063 .globl hello_world 9064 .p2align 8 9065 .amdgpu_hsa_kernel hello_world 9066 9067 hello_world: 9068 9069 .amd_kernel_code_t 9070 enable_sgpr_kernarg_segment_ptr = 1 9071 is_ptr64 = 1 9072 compute_pgm_rsrc1_vgprs = 0 9073 compute_pgm_rsrc1_sgprs = 0 9074 compute_pgm_rsrc2_user_sgpr = 2 9075 compute_pgm_rsrc1_wgp_mode = 0 9076 compute_pgm_rsrc1_mem_ordered = 0 9077 compute_pgm_rsrc1_fwd_progress = 1 9078 .end_amd_kernel_code_t 9079 9080 s_load_dwordx2 s[0:1], s[0:1] 0x0 9081 v_mov_b32 v0, 3.14159 9082 s_waitcnt lgkmcnt(0) 9083 v_mov_b32 v1, s0 9084 v_mov_b32 v2, s1 9085 flat_store_dword v[1:2], v0 9086 s_endpgm 9087 .Lfunc_end0: 9088 .size hello_world, .Lfunc_end0-hello_world 9089 9090.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4: 9091 9092Code Object V3 to V4 Predefined Symbols 9093~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9094 9095The AMDGPU assembler defines and updates some symbols automatically. These 9096symbols do not affect code generation. 9097 9098.amdgcn.gfx_generation_number 9099+++++++++++++++++++++++++++++ 9100 9101Set to the GFX major generation number of the target being assembled for. For 9102example, when assembling for a "GFX9" target this will be set to the integer 9103value "9". The possible GFX major generation numbers are presented in 9104:ref:`amdgpu-processors`. 9105 9106.amdgcn.gfx_generation_minor 9107++++++++++++++++++++++++++++ 9108 9109Set to the GFX minor generation number of the target being assembled for. For 9110example, when assembling for a "GFX810" target this will be set to the integer 9111value "1". The possible GFX minor generation numbers are presented in 9112:ref:`amdgpu-processors`. 9113 9114.amdgcn.gfx_generation_stepping 9115+++++++++++++++++++++++++++++++ 9116 9117Set to the GFX stepping generation number of the target being assembled for. 9118For example, when assembling for a "GFX704" target this will be set to the 9119integer value "4". The possible GFX stepping generation numbers are presented 9120in :ref:`amdgpu-processors`. 9121 9122.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: 9123 9124.amdgcn.next_free_vgpr 9125++++++++++++++++++++++ 9126 9127Set to zero before assembly begins. At each instruction, if the current value 9128of this symbol is less than or equal to the maximum VGPR number explicitly 9129referenced within that instruction then the symbol value is updated to equal 9130that VGPR number plus one. 9131 9132May be used to set the `.amdhsa_next_free_vgpr` directive in 9133:ref:`amdhsa-kernel-directives-table`. 9134 9135May be set at any time, e.g. manually set to zero at the start of each kernel. 9136 9137.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: 9138 9139.amdgcn.next_free_sgpr 9140++++++++++++++++++++++ 9141 9142Set to zero before assembly begins. At each instruction, if the current value 9143of this symbol is less than or equal the maximum SGPR number explicitly 9144referenced within that instruction then the symbol value is updated to equal 9145that SGPR number plus one. 9146 9147May be used to set the `.amdhsa_next_free_spgr` directive in 9148:ref:`amdhsa-kernel-directives-table`. 9149 9150May be set at any time, e.g. manually set to zero at the start of each kernel. 9151 9152.. _amdgpu-amdhsa-assembler-directives-v3-v4: 9153 9154Code Object V3 to V4 Directives 9155~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9156 9157Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 9158architecture processors, and are not OS-specific. Directives which begin with 9159``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 9160``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 9161:ref:`amdgpu-processors`. 9162 9163.amdgcn_target <target-triple> "-" <target-id> 9164++++++++++++++++++++++++++++++++++++++++++++++ 9165 9166Optional directive which declares the ``<target-triple>-<target-id>`` supported 9167by the containing assembler source file. Used by the assembler to validate 9168command-line options such as ``-triple``, ``-mcpu``, and 9169``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See 9170:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`. 9171 9172.amdhsa_kernel <name> 9173+++++++++++++++++++++ 9174 9175Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 9176``<name>.kd``, in the current location of the current section. Only valid when 9177the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 9178instruction to execute, and does not need to be previously defined. 9179 9180Marks the beginning of a list of directives used to generate the bytes of a 9181kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 9182Directives which may appear in this list are described in 9183:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 9184be valid for the target being assembled for, and cannot be repeated. Directives 9185support the range of values specified by the field they reference in 9186:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 9187assumed to have its default value, unless it is marked as "Required", in which 9188case it is an error to omit the directive. This list of directives is 9189terminated by an ``.end_amdhsa_kernel`` directive. 9190 9191 .. table:: AMDHSA Kernel Assembler Directives 9192 :name: amdhsa-kernel-directives-table 9193 9194 ======================================================== =================== ============ =================== 9195 Directive Default Supported On Description 9196 ======================================================== =================== ============ =================== 9197 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in 9198 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9199 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in 9200 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9201 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in 9202 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9203 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 9204 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9205 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in 9206 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9207 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in 9208 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9209 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 9210 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9211 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in 9212 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9213 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 9214 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9215 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 9216 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9217 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in 9218 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9219 Specific 9220 (wavefrontsize64) 9221 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in 9222 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9223 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in 9224 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9225 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 9226 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9227 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 9228 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9229 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in 9230 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9231 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in 9232 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9233 Possible values are defined in 9234 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 9235 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. 9236 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 9237 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9238 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. 9239 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 9240 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9241 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. 9242 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 9243 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9244 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access 9245 scratch memory. Used to calculate 9246 GRANULATED_WAVEFRONT_SGPR_COUNT in 9247 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9248 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. 9249 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 9250 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9251 (xnack) 9252 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in 9253 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9254 Possible values are defined in 9255 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 9256 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in 9257 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9258 Possible values are defined in 9259 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 9260 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in 9261 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9262 Possible values are defined in 9263 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 9264 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in 9265 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9266 Possible values are defined in 9267 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 9268 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in 9269 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9270 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in 9271 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9272 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in 9273 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9274 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in 9275 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`. 9276 Specific 9277 (cumode) 9278 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in 9279 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9280 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in 9281 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 9282 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 9283 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9284 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 9285 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9286 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 9287 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9288 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 9289 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9290 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 9291 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9292 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 9293 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9294 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 9295 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 9296 ======================================================== =================== ============ =================== 9297 9298.amdgpu_metadata 9299++++++++++++++++ 9300 9301Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` 9302note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`). 9303 9304The contents must be in the [YAML]_ markup format, with the same structure and 9305semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or 9306:ref:`amdgpu-amdhsa-code-object-metadata-v4`. 9307 9308This directive is terminated by an ``.end_amdgpu_metadata`` directive. 9309 9310.. _amdgpu-amdhsa-assembler-example-v3-v4: 9311 9312Code Object V3 to V4 Example Source Code 9313~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9314 9315Here is an example of a minimal assembly source file, defining one HSA kernel: 9316 9317.. code:: 9318 :number-lines: 9319 9320 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 9321 9322 .text 9323 .globl hello_world 9324 .p2align 8 9325 .type hello_world,@function 9326 hello_world: 9327 s_load_dwordx2 s[0:1], s[0:1] 0x0 9328 v_mov_b32 v0, 3.14159 9329 s_waitcnt lgkmcnt(0) 9330 v_mov_b32 v1, s0 9331 v_mov_b32 v2, s1 9332 flat_store_dword v[1:2], v0 9333 s_endpgm 9334 .Lfunc_end0: 9335 .size hello_world, .Lfunc_end0-hello_world 9336 9337 .rodata 9338 .p2align 6 9339 .amdhsa_kernel hello_world 9340 .amdhsa_user_sgpr_kernarg_segment_ptr 1 9341 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 9342 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 9343 .end_amdhsa_kernel 9344 9345 .amdgpu_metadata 9346 --- 9347 amdhsa.version: 9348 - 1 9349 - 0 9350 amdhsa.kernels: 9351 - .name: hello_world 9352 .symbol: hello_world.kd 9353 .kernarg_segment_size: 48 9354 .group_segment_fixed_size: 0 9355 .private_segment_fixed_size: 0 9356 .kernarg_segment_align: 4 9357 .wavefront_size: 64 9358 .sgpr_count: 2 9359 .vgpr_count: 3 9360 .max_flat_workgroup_size: 256 9361 ... 9362 .end_amdgpu_metadata 9363 9364If an assembly source file contains multiple kernels and/or functions, the 9365:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and 9366:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using 9367the ``.set <symbol>, <expression>`` directive. For example, in the case of two 9368kernels, where ``function1`` is only called from ``kernel1`` it is sufficient 9369to group the function with the kernel that calls it and reset the symbols 9370between the two connected components: 9371 9372.. code:: 9373 :number-lines: 9374 9375 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 9376 9377 // gpr tracking symbols are implicitly set to zero 9378 9379 .text 9380 .globl kern0 9381 .p2align 8 9382 .type kern0,@function 9383 kern0: 9384 // ... 9385 s_endpgm 9386 .Lkern0_end: 9387 .size kern0, .Lkern0_end-kern0 9388 9389 .rodata 9390 .p2align 6 9391 .amdhsa_kernel kern0 9392 // ... 9393 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 9394 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 9395 .end_amdhsa_kernel 9396 9397 // reset symbols to begin tracking usage in func1 and kern1 9398 .set .amdgcn.next_free_vgpr, 0 9399 .set .amdgcn.next_free_sgpr, 0 9400 9401 .text 9402 .hidden func1 9403 .global func1 9404 .p2align 2 9405 .type func1,@function 9406 func1: 9407 // ... 9408 s_setpc_b64 s[30:31] 9409 .Lfunc1_end: 9410 .size func1, .Lfunc1_end-func1 9411 9412 .globl kern1 9413 .p2align 8 9414 .type kern1,@function 9415 kern1: 9416 // ... 9417 s_getpc_b64 s[4:5] 9418 s_add_u32 s4, s4, func1@rel32@lo+4 9419 s_addc_u32 s5, s5, func1@rel32@lo+4 9420 s_swappc_b64 s[30:31], s[4:5] 9421 // ... 9422 s_endpgm 9423 .Lkern1_end: 9424 .size kern1, .Lkern1_end-kern1 9425 9426 .rodata 9427 .p2align 6 9428 .amdhsa_kernel kern1 9429 // ... 9430 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 9431 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 9432 .end_amdhsa_kernel 9433 9434These symbols cannot identify connected components in order to automatically 9435track the usage for each kernel. However, in some cases careful organization of 9436the kernels and functions in the source file means there is minimal additional 9437effort required to accurately calculate GPR usage. 9438 9439Additional Documentation 9440======================== 9441 9442.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 9443.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 9444.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 9445.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 9446.. [AMD-GCN-GFX10] `AMD "RDNA 1.0" Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__ 9447.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 9448.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 9449.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 9450.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 9451.. [AMD-ROCm] `AMD ROCm Platform <https://rocmdocs.amd.com/>`__ 9452.. [AMD-ROCm-github] `AMD ROCm github <http://github.com/RadeonOpenCompute>`__ 9453.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__ 9454.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__ 9455.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 9456.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 9457.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 9458.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 9459.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__ 9460.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 9461.. [SEMVER] `Semantic Versioning <https://semver.org/>`__ 9462.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 9463