1============================= 2User Guide for NVPTX Back-end 3============================= 4 5.. contents:: 6 :local: 7 :depth: 3 8 9 10Introduction 11============ 12 13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR 14along with a defined set of conventions used to represent GPU programming 15concepts. This document provides an overview of the general usage of the back- 16end, including a description of the conventions used and the set of accepted 17LLVM IR. 18 19.. note:: 20 21 This document assumes a basic familiarity with CUDA and the PTX 22 assembly language. Information about the CUDA Driver API and the PTX assembly 23 language can be found in the `CUDA documentation 24 <http://docs.nvidia.com/cuda/index.html>`_. 25 26 27 28Conventions 29=========== 30 31Marking Functions as Kernels 32---------------------------- 33 34In PTX, there are two types of functions: *device functions*, which are only 35callable by device code, and *kernel functions*, which are callable by host 36code. By default, the back-end will emit device functions. Metadata is used to 37declare a function as a kernel function. This metadata is attached to the 38``nvvm.annotations`` named metadata object, and has the following format: 39 40.. code-block:: llvm 41 42 !0 = metadata !{<function-ref>, metadata !"kernel", i32 1} 43 44The first parameter is a reference to the kernel function. The following 45example shows a kernel function calling a device function in LLVM IR. The 46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. 47 48.. code-block:: llvm 49 50 define float @my_fmad(float %x, float %y, float %z) { 51 %mul = fmul float %x, %y 52 %add = fadd float %mul, %z 53 ret float %add 54 } 55 56 define void @my_kernel(float* %ptr) { 57 %val = load float* %ptr 58 %ret = call float @my_fmad(float %val, float %val, float %val) 59 store float %ret, float* %ptr 60 ret void 61 } 62 63 !nvvm.annotations = !{!1} 64 !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1} 65 66When compiled, the PTX kernel functions are callable by host-side code. 67 68 69.. _address_spaces: 70 71Address Spaces 72-------------- 73 74The NVPTX back-end uses the following address space mapping: 75 76 ============= ====================== 77 Address Space Memory Space 78 ============= ====================== 79 0 Generic 80 1 Global 81 2 Internal Use 82 3 Shared 83 4 Constant 84 5 Local 85 ============= ====================== 86 87Every global variable and pointer type is assigned to one of these address 88spaces, with 0 being the default address space. Intrinsics are provided which 89can be used to convert pointers between the generic and non-generic address 90spaces. 91 92As an example, the following IR will define an array ``@g`` that resides in 93global device memory. 94 95.. code-block:: llvm 96 97 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] 98 99LLVM IR functions can read and write to this array, and host-side code can 100copy data to it by name with the CUDA Driver API. 101 102Note that since address space 0 is the generic space, it is illegal to have 103global variables in address space 0. Address space 0 is the default address 104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global 105variables. 106 107 108Triples 109------- 110 111The NVPTX target uses the module triple to select between 32/64-bit code 112generation and the driver-compiler interface to use. The triple architecture 113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The 114operating system should be one of ``cuda`` or ``nvcl``, which determines the 115interface used by the generated code to communicate with the driver. Most 116users will want to use ``cuda`` as the operating system, which makes the 117generated PTX compatible with the CUDA Driver API. 118 119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda`` 120 121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda`` 122 123 124 125.. _nvptx_intrinsics: 126 127NVPTX Intrinsics 128================ 129 130Address Space Conversion 131------------------------ 132 133'``llvm.nvvm.ptr.*.to.gen``' Intrinsics 134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 135 136Syntax: 137""""""" 138 139These are overloaded intrinsics. You can use these on any pointer types. 140 141.. code-block:: llvm 142 143 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*) 144 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) 145 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*) 146 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*) 147 148Overview: 149""""""""" 150 151The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic 152address space to a generic address space pointer. 153 154Semantics: 155"""""""""" 156 157These intrinsics modify the pointer value to be a valid generic address space 158pointer. 159 160 161'``llvm.nvvm.ptr.gen.to.*``' Intrinsics 162^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 163 164Syntax: 165""""""" 166 167These are overloaded intrinsics. You can use these on any pointer types. 168 169.. code-block:: llvm 170 171 declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*) 172 declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*) 173 declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*) 174 declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*) 175 176Overview: 177""""""""" 178 179The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic 180address space to a pointer in the target address space. Note that these 181intrinsics are only useful if the address space of the target address space of 182the pointer is known. It is not legal to use address space conversion 183intrinsics to convert a pointer from one non-generic address space to another 184non-generic address space. 185 186Semantics: 187"""""""""" 188 189These intrinsics modify the pointer value to be a valid pointer in the target 190non-generic address space. 191 192 193Reading PTX Special Registers 194----------------------------- 195 196'``llvm.nvvm.read.ptx.sreg.*``' 197^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 198 199Syntax: 200""""""" 201 202.. code-block:: llvm 203 204 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() 205 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() 206 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() 207 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() 208 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() 209 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() 210 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() 211 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() 212 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() 213 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() 214 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() 215 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() 216 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() 217 218Overview: 219""""""""" 220 221The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX 222special registers, in particular the kernel launch bounds. These registers 223map in the following way to CUDA builtins: 224 225 ============ ===================================== 226 CUDA Builtin PTX Special Register Intrinsic 227 ============ ===================================== 228 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` 229 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` 230 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` 231 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` 232 ============ ===================================== 233 234 235Barriers 236-------- 237 238'``llvm.nvvm.barrier0``' 239^^^^^^^^^^^^^^^^^^^^^^^^^^^ 240 241Syntax: 242""""""" 243 244.. code-block:: llvm 245 246 declare void @llvm.nvvm.barrier0() 247 248Overview: 249""""""""" 250 251The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` 252instruction, equivalent to the ``__syncthreads()`` call in CUDA. 253 254 255Other Intrinsics 256---------------- 257 258For the full set of NVPTX intrinsics, please see the 259``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. 260 261 262.. _libdevice: 263 264Linking with Libdevice 265====================== 266 267The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that 268implements many common mathematical functions. This library can be used as a 269high-performance math library for any compilers using the LLVM NVPTX target. 270The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and 271there is a separate version for each compute architecture. 272 273For a list of all math functions implemented in libdevice, see 274`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_. 275 276To accommodate various math-related compiler flags that can affect code 277generation of libdevice code, the library code depends on a special LLVM IR 278pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This 279pass looks for calls to the ``@__nvvm_reflect`` function and replaces them 280with constants based on the defined reflection parameters. Such conditional 281code often follows a pattern: 282 283.. code-block:: c++ 284 285 float my_function(float a) { 286 if (__nvvm_reflect("FASTMATH")) 287 return my_function_fast(a); 288 else 289 return my_function_precise(a); 290 } 291 292The default value for all unspecified reflection parameters is zero. 293 294The ``NVVMReflect`` pass should be executed early in the optimization 295pipeline, immediately after the link stage. The ``internalize`` pass is also 296recommended to remove unused math functions from the resulting PTX. For an 297input IR module ``module.bc``, the following compilation flow is recommended: 298 2991. Save list of external functions in ``module.bc`` 3002. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc`` 3013. Internalize all functions not in list from (1) 3024. Eliminate all unused internal functions 3035. Run ``NVVMReflect`` pass 3046. Run standard optimization pipeline 305 306.. note:: 307 308 ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the 309 libdevice functions. It is possible to link two IR modules that have been 310 linked against libdevice using different reflection variables. 311 312Since the ``NVVMReflect`` pass replaces conditionals with constants, it will 313often leave behind dead code of the form: 314 315.. code-block:: llvm 316 317 entry: 318 .. 319 br i1 true, label %foo, label %bar 320 foo: 321 .. 322 bar: 323 ; Dead code 324 .. 325 326Therefore, it is recommended that ``NVVMReflect`` is executed early in the 327optimization pipeline before dead-code elimination. 328 329 330Reflection Parameters 331--------------------- 332 333The libdevice library currently uses the following reflection parameters to 334control code generation: 335 336==================== ====================================================== 337Flag Description 338==================== ====================================================== 339``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero 340==================== ====================================================== 341 342 343Invoking NVVMReflect 344-------------------- 345 346To ensure that all dead code caused by the reflection pass is eliminated, it 347is recommended that the reflection pass is executed early in the LLVM IR 348optimization pipeline. The pass takes an optional mapping of reflection 349parameter name to an integer value. This mapping can be specified as either a 350command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when 351programmatically creating a pass pipeline. 352 353With ``opt``: 354 355.. code-block:: text 356 357 # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc 358 359 360With programmatic pass pipeline: 361 362.. code-block:: c++ 363 364 extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping); 365 366 StringMap<int> ReflectParams; 367 ReflectParams["__CUDA_FTZ"] = 1; 368 Passes.add(createNVVMReflectPass(ReflectParams)); 369 370 371 372Executing PTX 373============= 374 375The most common way to execute PTX assembly on a GPU device is to use the CUDA 376Driver API. This API is a low-level interface to the GPU driver and allows for 377JIT compilation of PTX code to native GPU machine code. 378 379Initializing the Driver API: 380 381.. code-block:: c++ 382 383 CUdevice device; 384 CUcontext context; 385 386 // Initialize the driver API 387 cuInit(0); 388 // Get a handle to the first compute device 389 cuDeviceGet(&device, 0); 390 // Create a compute device context 391 cuCtxCreate(&context, 0, device); 392 393JIT compiling a PTX string to a device binary: 394 395.. code-block:: c++ 396 397 CUmodule module; 398 CUfunction funcion; 399 400 // JIT compile a null-terminated PTX string 401 cuModuleLoadData(&module, (void*)PTXString); 402 403 // Get a handle to the "myfunction" kernel function 404 cuModuleGetFunction(&function, module, "myfunction"); 405 406For full examples of executing PTX assembly, please see the `CUDA Samples 407<https://developer.nvidia.com/cuda-downloads>`_ distribution. 408 409 410Common Issues 411============= 412 413ptxas complains of undefined function: __nvvm_reflect 414----------------------------------------------------- 415 416When linking with libdevice, the ``NVVMReflect`` pass must be used. See 417:ref:`libdevice` for more information. 418 419 420Tutorial: A Simple Compute Kernel 421================================= 422 423To start, let us take a look at a simple compute kernel written directly in 424LLVM IR. The kernel implements vector addition, where each thread computes one 425element of the output vector C from the input vectors A and B. To make this 426easier, we also assume that only a single CTA (thread block) will be launched, 427and that it will be one dimensional. 428 429 430The Kernel 431---------- 432 433.. code-block:: llvm 434 435 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 436 target triple = "nvptx64-nvidia-cuda" 437 438 ; Intrinsic to read X component of thread ID 439 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 440 441 define void @kernel(float addrspace(1)* %A, 442 float addrspace(1)* %B, 443 float addrspace(1)* %C) { 444 entry: 445 ; What is my ID? 446 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 447 448 ; Compute pointers into A, B, and C 449 %ptrA = getelementptr float addrspace(1)* %A, i32 %id 450 %ptrB = getelementptr float addrspace(1)* %B, i32 %id 451 %ptrC = getelementptr float addrspace(1)* %C, i32 %id 452 453 ; Read A, B 454 %valA = load float addrspace(1)* %ptrA, align 4 455 %valB = load float addrspace(1)* %ptrB, align 4 456 457 ; Compute C = A + B 458 %valC = fadd float %valA, %valB 459 460 ; Store back to C 461 store float %valC, float addrspace(1)* %ptrC, align 4 462 463 ret void 464 } 465 466 !nvvm.annotations = !{!0} 467 !0 = metadata !{void (float addrspace(1)*, 468 float addrspace(1)*, 469 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} 470 471 472We can use the LLVM ``llc`` tool to directly run the NVPTX code generator: 473 474.. code-block:: text 475 476 # llc -mcpu=sm_20 kernel.ll -o kernel.ptx 477 478 479.. note:: 480 481 If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32`` 482 in the module data layout string and use ``nvptx-nvidia-cuda`` as the 483 target triple. 484 485 486The output we get from ``llc`` (as of LLVM 3.4): 487 488.. code-block:: text 489 490 // 491 // Generated by LLVM NVPTX Back-End 492 // 493 494 .version 3.1 495 .target sm_20 496 .address_size 64 497 498 // .globl kernel 499 // @kernel 500 .visible .entry kernel( 501 .param .u64 kernel_param_0, 502 .param .u64 kernel_param_1, 503 .param .u64 kernel_param_2 504 ) 505 { 506 .reg .f32 %f<4>; 507 .reg .s32 %r<2>; 508 .reg .s64 %rl<8>; 509 510 // BB#0: // %entry 511 ld.param.u64 %rl1, [kernel_param_0]; 512 mov.u32 %r1, %tid.x; 513 mul.wide.s32 %rl2, %r1, 4; 514 add.s64 %rl3, %rl1, %rl2; 515 ld.param.u64 %rl4, [kernel_param_1]; 516 add.s64 %rl5, %rl4, %rl2; 517 ld.param.u64 %rl6, [kernel_param_2]; 518 add.s64 %rl7, %rl6, %rl2; 519 ld.global.f32 %f1, [%rl3]; 520 ld.global.f32 %f2, [%rl5]; 521 add.f32 %f3, %f1, %f2; 522 st.global.f32 [%rl7], %f3; 523 ret; 524 } 525 526 527Dissecting the Kernel 528--------------------- 529 530Now let us dissect the LLVM IR that makes up this kernel. 531 532Data Layout 533^^^^^^^^^^^ 534 535The data layout string determines the size in bits of common data types, their 536ABI alignment, and their storage size. For NVPTX, you should use one of the 537following: 538 53932-bit PTX: 540 541.. code-block:: llvm 542 543 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 544 54564-bit PTX: 546 547.. code-block:: llvm 548 549 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 550 551 552Target Intrinsics 553^^^^^^^^^^^^^^^^^ 554 555In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to 556read the X component of the current thread's ID, which corresponds to a read 557of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of 558intrinsics. A short list is shown below; please see 559``include/llvm/IR/IntrinsicsNVVM.td`` for the full list. 560 561 562================================================ ==================== 563Intrinsic CUDA Equivalent 564================================================ ==================== 565``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z} 566``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z} 567``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z} 568``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z} 569``void @llvm.cuda.syncthreads()`` __syncthreads() 570================================================ ==================== 571 572 573Address Spaces 574^^^^^^^^^^^^^^ 575 576You may have noticed that all of the pointer types in the LLVM IR example had 577an explicit address space specifier. What is address space 1? NVIDIA GPU 578devices (generally) have four types of memory: 579 580- Global: Large, off-chip memory 581- Shared: Small, on-chip memory shared among all threads in a CTA 582- Local: Per-thread, private memory 583- Constant: Read-only memory shared across all threads 584 585These different types of memory are represented in LLVM IR as address spaces. 586There is also a fifth address space used by the NVPTX code generator that 587corresponds to the "generic" address space. This address space can represent 588addresses in any other address space (with a few exceptions). This allows 589users to write IR functions that can load/store memory using the same 590instructions. Intrinsics are provided to convert pointers between the generic 591and non-generic address spaces. 592 593See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information. 594 595 596Kernel Metadata 597^^^^^^^^^^^^^^^ 598 599In PTX, a function can be either a `kernel` function (callable from the host 600program), or a `device` function (callable only from GPU code). You can think 601of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR 602function as a `kernel` function, we make use of special LLVM metadata. The 603NVPTX back-end will look for a named metadata node called 604``nvvm.annotations``. This named metadata must contain a list of metadata that 605describe the IR. For our purposes, we need to declare a metadata node that 606assigns the "kernel" attribute to the LLVM IR function that should be emitted 607as a PTX `kernel` function. These metadata nodes take the form: 608 609.. code-block:: text 610 611 metadata !{<function ref>, metadata !"kernel", i32 1} 612 613For the previous example, we have: 614 615.. code-block:: llvm 616 617 !nvvm.annotations = !{!0} 618 !0 = metadata !{void (float addrspace(1)*, 619 float addrspace(1)*, 620 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} 621 622Here, we have a single metadata declaration in ``nvvm.annotations``. This 623metadata annotates our ``@kernel`` function with the ``kernel`` attribute. 624 625 626Running the Kernel 627------------------ 628 629Generating PTX from LLVM IR is all well and good, but how do we execute it on 630a real GPU device? The CUDA Driver API provides a convenient mechanism for 631loading and JIT compiling PTX to a native GPU device, and launching a kernel. 632The API is similar to OpenCL. A simple example showing how to load and 633execute our vector addition code is shown below. Note that for brevity this 634code does not perform much error checking! 635 636.. note:: 637 638 You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline 639 compile PTX to machine code (SASS) for a specific GPU architecture. Such 640 binaries can be loaded by the CUDA Driver API in the same way as PTX. This 641 can be useful for reducing startup time by precompiling the PTX kernels. 642 643 644.. code-block:: c++ 645 646 #include <iostream> 647 #include <fstream> 648 #include <cassert> 649 #include "cuda.h" 650 651 652 void checkCudaErrors(CUresult err) { 653 assert(err == CUDA_SUCCESS); 654 } 655 656 /// main - Program entry point 657 int main(int argc, char **argv) { 658 CUdevice device; 659 CUmodule cudaModule; 660 CUcontext context; 661 CUfunction function; 662 CUlinkState linker; 663 int devCount; 664 665 // CUDA initialization 666 checkCudaErrors(cuInit(0)); 667 checkCudaErrors(cuDeviceGetCount(&devCount)); 668 checkCudaErrors(cuDeviceGet(&device, 0)); 669 670 char name[128]; 671 checkCudaErrors(cuDeviceGetName(name, 128, device)); 672 std::cout << "Using CUDA Device [0]: " << name << "\n"; 673 674 int devMajor, devMinor; 675 checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device)); 676 std::cout << "Device Compute Capability: " 677 << devMajor << "." << devMinor << "\n"; 678 if (devMajor < 2) { 679 std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n"; 680 return 1; 681 } 682 683 std::ifstream t("kernel.ptx"); 684 if (!t.is_open()) { 685 std::cerr << "kernel.ptx not found\n"; 686 return 1; 687 } 688 std::string str((std::istreambuf_iterator<char>(t)), 689 std::istreambuf_iterator<char>()); 690 691 // Create driver context 692 checkCudaErrors(cuCtxCreate(&context, 0, device)); 693 694 // Create module for object 695 checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0)); 696 697 // Get kernel function 698 checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel")); 699 700 // Device data 701 CUdeviceptr devBufferA; 702 CUdeviceptr devBufferB; 703 CUdeviceptr devBufferC; 704 705 checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16)); 706 checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16)); 707 checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16)); 708 709 float* hostA = new float[16]; 710 float* hostB = new float[16]; 711 float* hostC = new float[16]; 712 713 // Populate input 714 for (unsigned i = 0; i != 16; ++i) { 715 hostA[i] = (float)i; 716 hostB[i] = (float)(2*i); 717 hostC[i] = 0.0f; 718 } 719 720 checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16)); 721 checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16)); 722 723 724 unsigned blockSizeX = 16; 725 unsigned blockSizeY = 1; 726 unsigned blockSizeZ = 1; 727 unsigned gridSizeX = 1; 728 unsigned gridSizeY = 1; 729 unsigned gridSizeZ = 1; 730 731 // Kernel parameters 732 void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC }; 733 734 std::cout << "Launching kernel\n"; 735 736 // Kernel launch 737 checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ, 738 blockSizeX, blockSizeY, blockSizeZ, 739 0, NULL, KernelParams, NULL)); 740 741 // Retrieve device data 742 checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16)); 743 744 745 std::cout << "Results:\n"; 746 for (unsigned i = 0; i != 16; ++i) { 747 std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n"; 748 } 749 750 751 // Clean up after ourselves 752 delete [] hostA; 753 delete [] hostB; 754 delete [] hostC; 755 756 // Clean-up 757 checkCudaErrors(cuMemFree(devBufferA)); 758 checkCudaErrors(cuMemFree(devBufferB)); 759 checkCudaErrors(cuMemFree(devBufferC)); 760 checkCudaErrors(cuModuleUnload(cudaModule)); 761 checkCudaErrors(cuCtxDestroy(context)); 762 763 return 0; 764 } 765 766 767You will need to link with the CUDA driver and specify the path to cuda.h. 768 769.. code-block:: text 770 771 # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda 772 773We don't need to specify a path to ``libcuda.so`` since this is installed in a 774system location by the driver, not the CUDA toolkit. 775 776If everything goes as planned, you should see the following output when 777running the compiled program: 778 779.. code-block:: text 780 781 Using CUDA Device [0]: GeForce GTX 680 782 Device Compute Capability: 3.0 783 Launching kernel 784 Results: 785 0 + 0 = 0 786 1 + 2 = 3 787 2 + 4 = 6 788 3 + 6 = 9 789 4 + 8 = 12 790 5 + 10 = 15 791 6 + 12 = 18 792 7 + 14 = 21 793 8 + 16 = 24 794 9 + 18 = 27 795 10 + 20 = 30 796 11 + 22 = 33 797 12 + 24 = 36 798 13 + 26 = 39 799 14 + 28 = 42 800 15 + 30 = 45 801 802.. note:: 803 804 You will likely see a different device identifier based on your hardware 805 806 807Tutorial: Linking with Libdevice 808================================ 809 810In this tutorial, we show a simple example of linking LLVM IR with the 811libdevice library. We will use the same kernel as the previous tutorial, 812except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``. 813Libdevice provides an ``__nv_powf`` function that we will use. 814 815.. code-block:: llvm 816 817 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 818 target triple = "nvptx64-nvidia-cuda" 819 820 ; Intrinsic to read X component of thread ID 821 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 822 ; libdevice function 823 declare float @__nv_powf(float, float) 824 825 define void @kernel(float addrspace(1)* %A, 826 float addrspace(1)* %B, 827 float addrspace(1)* %C) { 828 entry: 829 ; What is my ID? 830 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 831 832 ; Compute pointers into A, B, and C 833 %ptrA = getelementptr float addrspace(1)* %A, i32 %id 834 %ptrB = getelementptr float addrspace(1)* %B, i32 %id 835 %ptrC = getelementptr float addrspace(1)* %C, i32 %id 836 837 ; Read A, B 838 %valA = load float addrspace(1)* %ptrA, align 4 839 %valB = load float addrspace(1)* %ptrB, align 4 840 841 ; Compute C = pow(A, B) 842 %valC = call float @__nv_powf(float %valA, float %valB) 843 844 ; Store back to C 845 store float %valC, float addrspace(1)* %ptrC, align 4 846 847 ret void 848 } 849 850 !nvvm.annotations = !{!0} 851 !0 = metadata !{void (float addrspace(1)*, 852 float addrspace(1)*, 853 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} 854 855 856To compile this kernel, we perform the following steps: 857 8581. Link with libdevice 8592. Internalize all but the public kernel function 8603. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0 8614. Optimize the linked module 8625. Codegen the module 863 864 865These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc`` 866tools. In a complete compiler, these steps can also be performed entirely 867programmatically by setting up an appropriate pass configuration (see 868:ref:`libdevice`). 869 870.. code-block:: text 871 872 # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc 873 # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc 874 # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx 875 876.. note:: 877 878 The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any 879 undefined variables will default to zero. It is shown here for evaluation 880 purposes. 881 882 883This gives us the following PTX (excerpt): 884 885.. code-block:: text 886 887 // 888 // Generated by LLVM NVPTX Back-End 889 // 890 891 .version 3.1 892 .target sm_20 893 .address_size 64 894 895 // .globl kernel 896 // @kernel 897 .visible .entry kernel( 898 .param .u64 kernel_param_0, 899 .param .u64 kernel_param_1, 900 .param .u64 kernel_param_2 901 ) 902 { 903 .reg .pred %p<30>; 904 .reg .f32 %f<111>; 905 .reg .s32 %r<21>; 906 .reg .s64 %rl<8>; 907 908 // BB#0: // %entry 909 ld.param.u64 %rl2, [kernel_param_0]; 910 mov.u32 %r3, %tid.x; 911 ld.param.u64 %rl3, [kernel_param_1]; 912 mul.wide.s32 %rl4, %r3, 4; 913 add.s64 %rl5, %rl2, %rl4; 914 ld.param.u64 %rl6, [kernel_param_2]; 915 add.s64 %rl7, %rl3, %rl4; 916 add.s64 %rl1, %rl6, %rl4; 917 ld.global.f32 %f1, [%rl5]; 918 ld.global.f32 %f2, [%rl7]; 919 setp.eq.f32 %p1, %f1, 0f3F800000; 920 setp.eq.f32 %p2, %f2, 0f00000000; 921 or.pred %p3, %p1, %p2; 922 @%p3 bra BB0_1; 923 bra.uni BB0_2; 924 BB0_1: 925 mov.f32 %f110, 0f3F800000; 926 st.global.f32 [%rl1], %f110; 927 ret; 928 BB0_2: // %__nv_isnanf.exit.i 929 abs.f32 %f4, %f1; 930 setp.gtu.f32 %p4, %f4, 0f7F800000; 931 @%p4 bra BB0_4; 932 // BB#3: // %__nv_isnanf.exit5.i 933 abs.f32 %f5, %f2; 934 setp.le.f32 %p5, %f5, 0f7F800000; 935 @%p5 bra BB0_5; 936 BB0_4: // %.critedge1.i 937 add.f32 %f110, %f1, %f2; 938 st.global.f32 [%rl1], %f110; 939 ret; 940 BB0_5: // %__nv_isinff.exit.i 941 942 ... 943 944 BB0_26: // %__nv_truncf.exit.i.i.i.i.i 945 mul.f32 %f90, %f107, 0f3FB8AA3B; 946 cvt.rzi.f32.f32 %f91, %f90; 947 mov.f32 %f92, 0fBF317200; 948 fma.rn.f32 %f93, %f91, %f92, %f107; 949 mov.f32 %f94, 0fB5BFBE8E; 950 fma.rn.f32 %f95, %f91, %f94, %f93; 951 mul.f32 %f89, %f95, 0f3FB8AA3B; 952 // inline asm 953 ex2.approx.ftz.f32 %f88,%f89; 954 // inline asm 955 add.f32 %f96, %f91, 0f00000000; 956 ex2.approx.f32 %f97, %f96; 957 mul.f32 %f98, %f88, %f97; 958 setp.lt.f32 %p15, %f107, 0fC2D20000; 959 selp.f32 %f99, 0f00000000, %f98, %p15; 960 setp.gt.f32 %p16, %f107, 0f42D20000; 961 selp.f32 %f110, 0f7F800000, %f99, %p16; 962 setp.eq.f32 %p17, %f110, 0f7F800000; 963 @%p17 bra BB0_28; 964 // BB#27: 965 fma.rn.f32 %f110, %f110, %f108, %f110; 966 BB0_28: // %__internal_accurate_powf.exit.i 967 setp.lt.f32 %p18, %f1, 0f00000000; 968 setp.eq.f32 %p19, %f3, 0f3F800000; 969 and.pred %p20, %p18, %p19; 970 @!%p20 bra BB0_30; 971 bra.uni BB0_29; 972 BB0_29: 973 mov.b32 %r9, %f110; 974 xor.b32 %r10, %r9, -2147483648; 975 mov.b32 %f110, %r10; 976 BB0_30: // %__nv_powf.exit 977 st.global.f32 [%rl1], %f110; 978 ret; 979 } 980 981