1=============================
2User Guide for NVPTX Back-end
3=============================
4
5.. contents::
6   :local:
7   :depth: 3
8
9
10Introduction
11============
12
13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14along with a defined set of conventions used to represent GPU programming
15concepts. This document provides an overview of the general usage of the back-
16end, including a description of the conventions used and the set of accepted
17LLVM IR.
18
19.. note::
20
21   This document assumes a basic familiarity with CUDA and the PTX
22   assembly language. Information about the CUDA Driver API and the PTX assembly
23   language can be found in the `CUDA documentation
24   <http://docs.nvidia.com/cuda/index.html>`_.
25
26
27
28Conventions
29===========
30
31Marking Functions as Kernels
32----------------------------
33
34In PTX, there are two types of functions: *device functions*, which are only
35callable by device code, and *kernel functions*, which are callable by host
36code. By default, the back-end will emit device functions. Metadata is used to
37declare a function as a kernel function. This metadata is attached to the
38``nvvm.annotations`` named metadata object, and has the following format:
39
40.. code-block:: llvm
41
42   !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
43
44The first parameter is a reference to the kernel function. The following
45example shows a kernel function calling a device function in LLVM IR. The
46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
47
48.. code-block:: llvm
49
50    define float @my_fmad(float %x, float %y, float %z) {
51      %mul = fmul float %x, %y
52      %add = fadd float %mul, %z
53      ret float %add
54    }
55
56    define void @my_kernel(float* %ptr) {
57      %val = load float* %ptr
58      %ret = call float @my_fmad(float %val, float %val, float %val)
59      store float %ret, float* %ptr
60      ret void
61    }
62
63    !nvvm.annotations = !{!1}
64    !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
65
66When compiled, the PTX kernel functions are callable by host-side code.
67
68
69.. _address_spaces:
70
71Address Spaces
72--------------
73
74The NVPTX back-end uses the following address space mapping:
75
76   ============= ======================
77   Address Space Memory Space
78   ============= ======================
79   0             Generic
80   1             Global
81   2             Internal Use
82   3             Shared
83   4             Constant
84   5             Local
85   ============= ======================
86
87Every global variable and pointer type is assigned to one of these address
88spaces, with 0 being the default address space. Intrinsics are provided which
89can be used to convert pointers between the generic and non-generic address
90spaces.
91
92As an example, the following IR will define an array ``@g`` that resides in
93global device memory.
94
95.. code-block:: llvm
96
97    @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
98
99LLVM IR functions can read and write to this array, and host-side code can
100copy data to it by name with the CUDA Driver API.
101
102Note that since address space 0 is the generic space, it is illegal to have
103global variables in address space 0.  Address space 0 is the default address
104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
105variables.
106
107
108Triples
109-------
110
111The NVPTX target uses the module triple to select between 32/64-bit code
112generation and the driver-compiler interface to use. The triple architecture
113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
114operating system should be one of ``cuda`` or ``nvcl``, which determines the
115interface used by the generated code to communicate with the driver.  Most
116users will want to use ``cuda`` as the operating system, which makes the
117generated PTX compatible with the CUDA Driver API.
118
119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
120
121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
122
123
124
125.. _nvptx_intrinsics:
126
127NVPTX Intrinsics
128================
129
130Address Space Conversion
131------------------------
132
133'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136Syntax:
137"""""""
138
139These are overloaded intrinsics.  You can use these on any pointer types.
140
141.. code-block:: llvm
142
143    declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
144    declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
145    declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
146    declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
147
148Overview:
149"""""""""
150
151The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
152address space to a generic address space pointer.
153
154Semantics:
155""""""""""
156
157These intrinsics modify the pointer value to be a valid generic address space
158pointer.
159
160
161'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
162^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163
164Syntax:
165"""""""
166
167These are overloaded intrinsics.  You can use these on any pointer types.
168
169.. code-block:: llvm
170
171    declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
172    declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
173    declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
174    declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
175
176Overview:
177"""""""""
178
179The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
180address space to a pointer in the target address space.  Note that these
181intrinsics are only useful if the address space of the target address space of
182the pointer is known.  It is not legal to use address space conversion
183intrinsics to convert a pointer from one non-generic address space to another
184non-generic address space.
185
186Semantics:
187""""""""""
188
189These intrinsics modify the pointer value to be a valid pointer in the target
190non-generic address space.
191
192
193Reading PTX Special Registers
194-----------------------------
195
196'``llvm.nvvm.read.ptx.sreg.*``'
197^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198
199Syntax:
200"""""""
201
202.. code-block:: llvm
203
204    declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
205    declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
206    declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
207    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
208    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
209    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
210    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
211    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
212    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
213    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
214    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
215    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
216    declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
217
218Overview:
219"""""""""
220
221The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
222special registers, in particular the kernel launch bounds.  These registers
223map in the following way to CUDA builtins:
224
225   ============ =====================================
226   CUDA Builtin PTX Special Register Intrinsic
227   ============ =====================================
228   ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
229   ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
230   ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
231   ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
232   ============ =====================================
233
234
235Barriers
236--------
237
238'``llvm.nvvm.barrier0``'
239^^^^^^^^^^^^^^^^^^^^^^^^^^^
240
241Syntax:
242"""""""
243
244.. code-block:: llvm
245
246  declare void @llvm.nvvm.barrier0()
247
248Overview:
249"""""""""
250
251The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
252instruction, equivalent to the ``__syncthreads()`` call in CUDA.
253
254
255Other Intrinsics
256----------------
257
258For the full set of NVPTX intrinsics, please see the
259``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
260
261
262.. _libdevice:
263
264Linking with Libdevice
265======================
266
267The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
268implements many common mathematical functions. This library can be used as a
269high-performance math library for any compilers using the LLVM NVPTX target.
270The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
271there is a separate version for each compute architecture.
272
273For a list of all math functions implemented in libdevice, see
274`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
275
276To accommodate various math-related compiler flags that can affect code
277generation of libdevice code, the library code depends on a special LLVM IR
278pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
279pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
280with constants based on the defined reflection parameters. Such conditional
281code often follows a pattern:
282
283.. code-block:: c++
284
285  float my_function(float a) {
286    if (__nvvm_reflect("FASTMATH"))
287      return my_function_fast(a);
288    else
289      return my_function_precise(a);
290  }
291
292The default value for all unspecified reflection parameters is zero.
293
294The ``NVVMReflect`` pass should be executed early in the optimization
295pipeline, immediately after the link stage. The ``internalize`` pass is also
296recommended to remove unused math functions from the resulting PTX. For an
297input IR module ``module.bc``, the following compilation flow is recommended:
298
2991. Save list of external functions in ``module.bc``
3002. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
3013. Internalize all functions not in list from (1)
3024. Eliminate all unused internal functions
3035. Run ``NVVMReflect`` pass
3046. Run standard optimization pipeline
305
306.. note::
307
308  ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
309  libdevice functions. It is possible to link two IR modules that have been
310  linked against libdevice using different reflection variables.
311
312Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
313often leave behind dead code of the form:
314
315.. code-block:: llvm
316
317  entry:
318    ..
319    br i1 true, label %foo, label %bar
320  foo:
321    ..
322  bar:
323    ; Dead code
324    ..
325
326Therefore, it is recommended that ``NVVMReflect`` is executed early in the
327optimization pipeline before dead-code elimination.
328
329
330Reflection Parameters
331---------------------
332
333The libdevice library currently uses the following reflection parameters to
334control code generation:
335
336==================== ======================================================
337Flag                 Description
338==================== ======================================================
339``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
340==================== ======================================================
341
342
343Invoking NVVMReflect
344--------------------
345
346To ensure that all dead code caused by the reflection pass is eliminated, it
347is recommended that the reflection pass is executed early in the LLVM IR
348optimization pipeline. The pass takes an optional mapping of reflection
349parameter name to an integer value. This mapping can be specified as either a
350command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when
351programmatically creating a pass pipeline.
352
353With ``opt``:
354
355.. code-block:: text
356
357  # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc
358
359
360With programmatic pass pipeline:
361
362.. code-block:: c++
363
364  extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping);
365
366  StringMap<int> ReflectParams;
367  ReflectParams["__CUDA_FTZ"] = 1;
368  Passes.add(createNVVMReflectPass(ReflectParams));
369
370
371
372Executing PTX
373=============
374
375The most common way to execute PTX assembly on a GPU device is to use the CUDA
376Driver API. This API is a low-level interface to the GPU driver and allows for
377JIT compilation of PTX code to native GPU machine code.
378
379Initializing the Driver API:
380
381.. code-block:: c++
382
383    CUdevice device;
384    CUcontext context;
385
386    // Initialize the driver API
387    cuInit(0);
388    // Get a handle to the first compute device
389    cuDeviceGet(&device, 0);
390    // Create a compute device context
391    cuCtxCreate(&context, 0, device);
392
393JIT compiling a PTX string to a device binary:
394
395.. code-block:: c++
396
397    CUmodule module;
398    CUfunction funcion;
399
400    // JIT compile a null-terminated PTX string
401    cuModuleLoadData(&module, (void*)PTXString);
402
403    // Get a handle to the "myfunction" kernel function
404    cuModuleGetFunction(&function, module, "myfunction");
405
406For full examples of executing PTX assembly, please see the `CUDA Samples
407<https://developer.nvidia.com/cuda-downloads>`_ distribution.
408
409
410Common Issues
411=============
412
413ptxas complains of undefined function: __nvvm_reflect
414-----------------------------------------------------
415
416When linking with libdevice, the ``NVVMReflect`` pass must be used. See
417:ref:`libdevice` for more information.
418
419
420Tutorial: A Simple Compute Kernel
421=================================
422
423To start, let us take a look at a simple compute kernel written directly in
424LLVM IR. The kernel implements vector addition, where each thread computes one
425element of the output vector C from the input vectors A and B.  To make this
426easier, we also assume that only a single CTA (thread block) will be launched,
427and that it will be one dimensional.
428
429
430The Kernel
431----------
432
433.. code-block:: llvm
434
435  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
436  target triple = "nvptx64-nvidia-cuda"
437
438  ; Intrinsic to read X component of thread ID
439  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
440
441  define void @kernel(float addrspace(1)* %A,
442                      float addrspace(1)* %B,
443                      float addrspace(1)* %C) {
444  entry:
445    ; What is my ID?
446    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
447
448    ; Compute pointers into A, B, and C
449    %ptrA = getelementptr float addrspace(1)* %A, i32 %id
450    %ptrB = getelementptr float addrspace(1)* %B, i32 %id
451    %ptrC = getelementptr float addrspace(1)* %C, i32 %id
452
453    ; Read A, B
454    %valA = load float addrspace(1)* %ptrA, align 4
455    %valB = load float addrspace(1)* %ptrB, align 4
456
457    ; Compute C = A + B
458    %valC = fadd float %valA, %valB
459
460    ; Store back to C
461    store float %valC, float addrspace(1)* %ptrC, align 4
462
463    ret void
464  }
465
466  !nvvm.annotations = !{!0}
467  !0 = metadata !{void (float addrspace(1)*,
468                        float addrspace(1)*,
469                        float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
470
471
472We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
473
474.. code-block:: text
475
476  # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
477
478
479.. note::
480
481  If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
482  in the module data layout string and use ``nvptx-nvidia-cuda`` as the
483  target triple.
484
485
486The output we get from ``llc`` (as of LLVM 3.4):
487
488.. code-block:: text
489
490  //
491  // Generated by LLVM NVPTX Back-End
492  //
493
494  .version 3.1
495  .target sm_20
496  .address_size 64
497
498    // .globl kernel
499                                          // @kernel
500  .visible .entry kernel(
501    .param .u64 kernel_param_0,
502    .param .u64 kernel_param_1,
503    .param .u64 kernel_param_2
504  )
505  {
506    .reg .f32   %f<4>;
507    .reg .s32   %r<2>;
508    .reg .s64   %rl<8>;
509
510  // BB#0:                                // %entry
511    ld.param.u64    %rl1, [kernel_param_0];
512    mov.u32         %r1, %tid.x;
513    mul.wide.s32    %rl2, %r1, 4;
514    add.s64         %rl3, %rl1, %rl2;
515    ld.param.u64    %rl4, [kernel_param_1];
516    add.s64         %rl5, %rl4, %rl2;
517    ld.param.u64    %rl6, [kernel_param_2];
518    add.s64         %rl7, %rl6, %rl2;
519    ld.global.f32   %f1, [%rl3];
520    ld.global.f32   %f2, [%rl5];
521    add.f32         %f3, %f1, %f2;
522    st.global.f32   [%rl7], %f3;
523    ret;
524  }
525
526
527Dissecting the Kernel
528---------------------
529
530Now let us dissect the LLVM IR that makes up this kernel.
531
532Data Layout
533^^^^^^^^^^^
534
535The data layout string determines the size in bits of common data types, their
536ABI alignment, and their storage size.  For NVPTX, you should use one of the
537following:
538
53932-bit PTX:
540
541.. code-block:: llvm
542
543  target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
544
54564-bit PTX:
546
547.. code-block:: llvm
548
549  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
550
551
552Target Intrinsics
553^^^^^^^^^^^^^^^^^
554
555In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
556read the X component of the current thread's ID, which corresponds to a read
557of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
558intrinsics.  A short list is shown below; please see
559``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
560
561
562================================================ ====================
563Intrinsic                                        CUDA Equivalent
564================================================ ====================
565``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
566``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
567``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
568``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
569``void @llvm.cuda.syncthreads()``                __syncthreads()
570================================================ ====================
571
572
573Address Spaces
574^^^^^^^^^^^^^^
575
576You may have noticed that all of the pointer types in the LLVM IR example had
577an explicit address space specifier. What is address space 1? NVIDIA GPU
578devices (generally) have four types of memory:
579
580- Global: Large, off-chip memory
581- Shared: Small, on-chip memory shared among all threads in a CTA
582- Local: Per-thread, private memory
583- Constant: Read-only memory shared across all threads
584
585These different types of memory are represented in LLVM IR as address spaces.
586There is also a fifth address space used by the NVPTX code generator that
587corresponds to the "generic" address space.  This address space can represent
588addresses in any other address space (with a few exceptions).  This allows
589users to write IR functions that can load/store memory using the same
590instructions. Intrinsics are provided to convert pointers between the generic
591and non-generic address spaces.
592
593See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
594
595
596Kernel Metadata
597^^^^^^^^^^^^^^^
598
599In PTX, a function can be either a `kernel` function (callable from the host
600program), or a `device` function (callable only from GPU code). You can think
601of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
602function as a `kernel` function, we make use of special LLVM metadata. The
603NVPTX back-end will look for a named metadata node called
604``nvvm.annotations``. This named metadata must contain a list of metadata that
605describe the IR. For our purposes, we need to declare a metadata node that
606assigns the "kernel" attribute to the LLVM IR function that should be emitted
607as a PTX `kernel` function. These metadata nodes take the form:
608
609.. code-block:: text
610
611  metadata !{<function ref>, metadata !"kernel", i32 1}
612
613For the previous example, we have:
614
615.. code-block:: llvm
616
617  !nvvm.annotations = !{!0}
618  !0 = metadata !{void (float addrspace(1)*,
619                        float addrspace(1)*,
620                        float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
621
622Here, we have a single metadata declaration in ``nvvm.annotations``. This
623metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
624
625
626Running the Kernel
627------------------
628
629Generating PTX from LLVM IR is all well and good, but how do we execute it on
630a real GPU device? The CUDA Driver API provides a convenient mechanism for
631loading and JIT compiling PTX to a native GPU device, and launching a kernel.
632The API is similar to OpenCL.  A simple example showing how to load and
633execute our vector addition code is shown below. Note that for brevity this
634code does not perform much error checking!
635
636.. note::
637
638  You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
639  compile PTX to machine code (SASS) for a specific GPU architecture. Such
640  binaries can be loaded by the CUDA Driver API in the same way as PTX. This
641  can be useful for reducing startup time by precompiling the PTX kernels.
642
643
644.. code-block:: c++
645
646  #include <iostream>
647  #include <fstream>
648  #include <cassert>
649  #include "cuda.h"
650
651
652  void checkCudaErrors(CUresult err) {
653    assert(err == CUDA_SUCCESS);
654  }
655
656  /// main - Program entry point
657  int main(int argc, char **argv) {
658    CUdevice    device;
659    CUmodule    cudaModule;
660    CUcontext   context;
661    CUfunction  function;
662    CUlinkState linker;
663    int         devCount;
664
665    // CUDA initialization
666    checkCudaErrors(cuInit(0));
667    checkCudaErrors(cuDeviceGetCount(&devCount));
668    checkCudaErrors(cuDeviceGet(&device, 0));
669
670    char name[128];
671    checkCudaErrors(cuDeviceGetName(name, 128, device));
672    std::cout << "Using CUDA Device [0]: " << name << "\n";
673
674    int devMajor, devMinor;
675    checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
676    std::cout << "Device Compute Capability: "
677              << devMajor << "." << devMinor << "\n";
678    if (devMajor < 2) {
679      std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
680      return 1;
681    }
682
683    std::ifstream t("kernel.ptx");
684    if (!t.is_open()) {
685      std::cerr << "kernel.ptx not found\n";
686      return 1;
687    }
688    std::string str((std::istreambuf_iterator<char>(t)),
689                      std::istreambuf_iterator<char>());
690
691    // Create driver context
692    checkCudaErrors(cuCtxCreate(&context, 0, device));
693
694    // Create module for object
695    checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
696
697    // Get kernel function
698    checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
699
700    // Device data
701    CUdeviceptr devBufferA;
702    CUdeviceptr devBufferB;
703    CUdeviceptr devBufferC;
704
705    checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
706    checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
707    checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
708
709    float* hostA = new float[16];
710    float* hostB = new float[16];
711    float* hostC = new float[16];
712
713    // Populate input
714    for (unsigned i = 0; i != 16; ++i) {
715      hostA[i] = (float)i;
716      hostB[i] = (float)(2*i);
717      hostC[i] = 0.0f;
718    }
719
720    checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
721    checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
722
723
724    unsigned blockSizeX = 16;
725    unsigned blockSizeY = 1;
726    unsigned blockSizeZ = 1;
727    unsigned gridSizeX  = 1;
728    unsigned gridSizeY  = 1;
729    unsigned gridSizeZ  = 1;
730
731    // Kernel parameters
732    void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
733
734    std::cout << "Launching kernel\n";
735
736    // Kernel launch
737    checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
738                                   blockSizeX, blockSizeY, blockSizeZ,
739                                   0, NULL, KernelParams, NULL));
740
741    // Retrieve device data
742    checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
743
744
745    std::cout << "Results:\n";
746    for (unsigned i = 0; i != 16; ++i) {
747      std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
748    }
749
750
751    // Clean up after ourselves
752    delete [] hostA;
753    delete [] hostB;
754    delete [] hostC;
755
756    // Clean-up
757    checkCudaErrors(cuMemFree(devBufferA));
758    checkCudaErrors(cuMemFree(devBufferB));
759    checkCudaErrors(cuMemFree(devBufferC));
760    checkCudaErrors(cuModuleUnload(cudaModule));
761    checkCudaErrors(cuCtxDestroy(context));
762
763    return 0;
764  }
765
766
767You will need to link with the CUDA driver and specify the path to cuda.h.
768
769.. code-block:: text
770
771  # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
772
773We don't need to specify a path to ``libcuda.so`` since this is installed in a
774system location by the driver, not the CUDA toolkit.
775
776If everything goes as planned, you should see the following output when
777running the compiled program:
778
779.. code-block:: text
780
781  Using CUDA Device [0]: GeForce GTX 680
782  Device Compute Capability: 3.0
783  Launching kernel
784  Results:
785  0 + 0 = 0
786  1 + 2 = 3
787  2 + 4 = 6
788  3 + 6 = 9
789  4 + 8 = 12
790  5 + 10 = 15
791  6 + 12 = 18
792  7 + 14 = 21
793  8 + 16 = 24
794  9 + 18 = 27
795  10 + 20 = 30
796  11 + 22 = 33
797  12 + 24 = 36
798  13 + 26 = 39
799  14 + 28 = 42
800  15 + 30 = 45
801
802.. note::
803
804  You will likely see a different device identifier based on your hardware
805
806
807Tutorial: Linking with Libdevice
808================================
809
810In this tutorial, we show a simple example of linking LLVM IR with the
811libdevice library. We will use the same kernel as the previous tutorial,
812except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
813Libdevice provides an ``__nv_powf`` function that we will use.
814
815.. code-block:: llvm
816
817  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
818  target triple = "nvptx64-nvidia-cuda"
819
820  ; Intrinsic to read X component of thread ID
821  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
822  ; libdevice function
823  declare float @__nv_powf(float, float)
824
825  define void @kernel(float addrspace(1)* %A,
826                      float addrspace(1)* %B,
827                      float addrspace(1)* %C) {
828  entry:
829    ; What is my ID?
830    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
831
832    ; Compute pointers into A, B, and C
833    %ptrA = getelementptr float addrspace(1)* %A, i32 %id
834    %ptrB = getelementptr float addrspace(1)* %B, i32 %id
835    %ptrC = getelementptr float addrspace(1)* %C, i32 %id
836
837    ; Read A, B
838    %valA = load float addrspace(1)* %ptrA, align 4
839    %valB = load float addrspace(1)* %ptrB, align 4
840
841    ; Compute C = pow(A, B)
842    %valC = call float @__nv_powf(float %valA, float %valB)
843
844    ; Store back to C
845    store float %valC, float addrspace(1)* %ptrC, align 4
846
847    ret void
848  }
849
850  !nvvm.annotations = !{!0}
851  !0 = metadata !{void (float addrspace(1)*,
852                        float addrspace(1)*,
853                        float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
854
855
856To compile this kernel, we perform the following steps:
857
8581. Link with libdevice
8592. Internalize all but the public kernel function
8603. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
8614. Optimize the linked module
8625. Codegen the module
863
864
865These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
866tools. In a complete compiler, these steps can also be performed entirely
867programmatically by setting up an appropriate pass configuration (see
868:ref:`libdevice`).
869
870.. code-block:: text
871
872  # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
873  # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
874  # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
875
876.. note::
877
878  The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
879  undefined variables will default to zero. It is shown here for evaluation
880  purposes.
881
882
883This gives us the following PTX (excerpt):
884
885.. code-block:: text
886
887  //
888  // Generated by LLVM NVPTX Back-End
889  //
890
891  .version 3.1
892  .target sm_20
893  .address_size 64
894
895    // .globl kernel
896                                          // @kernel
897  .visible .entry kernel(
898    .param .u64 kernel_param_0,
899    .param .u64 kernel_param_1,
900    .param .u64 kernel_param_2
901  )
902  {
903    .reg .pred  %p<30>;
904    .reg .f32   %f<111>;
905    .reg .s32   %r<21>;
906    .reg .s64   %rl<8>;
907
908  // BB#0:                                // %entry
909    ld.param.u64  %rl2, [kernel_param_0];
910    mov.u32   %r3, %tid.x;
911    ld.param.u64  %rl3, [kernel_param_1];
912    mul.wide.s32  %rl4, %r3, 4;
913    add.s64   %rl5, %rl2, %rl4;
914    ld.param.u64  %rl6, [kernel_param_2];
915    add.s64   %rl7, %rl3, %rl4;
916    add.s64   %rl1, %rl6, %rl4;
917    ld.global.f32   %f1, [%rl5];
918    ld.global.f32   %f2, [%rl7];
919    setp.eq.f32 %p1, %f1, 0f3F800000;
920    setp.eq.f32 %p2, %f2, 0f00000000;
921    or.pred   %p3, %p1, %p2;
922    @%p3 bra  BB0_1;
923    bra.uni   BB0_2;
924  BB0_1:
925    mov.f32   %f110, 0f3F800000;
926    st.global.f32   [%rl1], %f110;
927    ret;
928  BB0_2:                                  // %__nv_isnanf.exit.i
929    abs.f32   %f4, %f1;
930    setp.gtu.f32  %p4, %f4, 0f7F800000;
931    @%p4 bra  BB0_4;
932  // BB#3:                                // %__nv_isnanf.exit5.i
933    abs.f32   %f5, %f2;
934    setp.le.f32 %p5, %f5, 0f7F800000;
935    @%p5 bra  BB0_5;
936  BB0_4:                                  // %.critedge1.i
937    add.f32   %f110, %f1, %f2;
938    st.global.f32   [%rl1], %f110;
939    ret;
940  BB0_5:                                  // %__nv_isinff.exit.i
941
942    ...
943
944  BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
945    mul.f32   %f90, %f107, 0f3FB8AA3B;
946    cvt.rzi.f32.f32 %f91, %f90;
947    mov.f32   %f92, 0fBF317200;
948    fma.rn.f32  %f93, %f91, %f92, %f107;
949    mov.f32   %f94, 0fB5BFBE8E;
950    fma.rn.f32  %f95, %f91, %f94, %f93;
951    mul.f32   %f89, %f95, 0f3FB8AA3B;
952    // inline asm
953    ex2.approx.ftz.f32 %f88,%f89;
954    // inline asm
955    add.f32   %f96, %f91, 0f00000000;
956    ex2.approx.f32  %f97, %f96;
957    mul.f32   %f98, %f88, %f97;
958    setp.lt.f32 %p15, %f107, 0fC2D20000;
959    selp.f32  %f99, 0f00000000, %f98, %p15;
960    setp.gt.f32 %p16, %f107, 0f42D20000;
961    selp.f32  %f110, 0f7F800000, %f99, %p16;
962    setp.eq.f32 %p17, %f110, 0f7F800000;
963    @%p17 bra   BB0_28;
964  // BB#27:
965    fma.rn.f32  %f110, %f110, %f108, %f110;
966  BB0_28:                                 // %__internal_accurate_powf.exit.i
967    setp.lt.f32 %p18, %f1, 0f00000000;
968    setp.eq.f32 %p19, %f3, 0f3F800000;
969    and.pred    %p20, %p18, %p19;
970    @!%p20 bra  BB0_30;
971    bra.uni   BB0_29;
972  BB0_29:
973    mov.b32    %r9, %f110;
974    xor.b32   %r10, %r9, -2147483648;
975    mov.b32    %f110, %r10;
976  BB0_30:                                 // %__nv_powf.exit
977    st.global.f32   [%rl1], %f110;
978    ret;
979  }
980
981