1=================================== 2Compiling CUDA C/C++ with LLVM 3=================================== 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document contains the user guides and the internals of compiling CUDA 12C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM 13and developers who want to improve LLVM for GPUs. This document assumes a basic 14familiarity with CUDA. Information about CUDA programming can be found in the 15`CUDA programming guide 16<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. 17 18How to Build LLVM with CUDA Support 19=================================== 20 21Below is a quick summary of downloading and building LLVM. Consult the `Getting 22Started <http://llvm.org/docs/GettingStarted.html>`_ page for more details on 23setting up LLVM. 24 25#. Checkout LLVM 26 27 .. code-block:: console 28 29 $ cd where-you-want-llvm-to-live 30 $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm 31 32#. Checkout Clang 33 34 .. code-block:: console 35 36 $ cd where-you-want-llvm-to-live 37 $ cd llvm/tools 38 $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang 39 40#. Configure and build LLVM and Clang 41 42 .. code-block:: console 43 44 $ cd where-you-want-llvm-to-live 45 $ mkdir build 46 $ cd build 47 $ cmake [options] .. 48 $ make 49 50How to Compile CUDA C/C++ with LLVM 51=================================== 52 53We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA 54CUDA installation Guide 55<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if 56you have not. 57 58Suppose you want to compile and run the following CUDA program (``axpy.cu``) 59which multiplies a ``float`` array by a ``float`` scalar (AXPY). 60 61.. code-block:: c++ 62 63 #include <helper_cuda.h> // for checkCudaErrors 64 65 #include <iostream> 66 67 __global__ void axpy(float a, float* x, float* y) { 68 y[threadIdx.x] = a * x[threadIdx.x]; 69 } 70 71 int main(int argc, char* argv[]) { 72 const int kDataLen = 4; 73 74 float a = 2.0f; 75 float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; 76 float host_y[kDataLen]; 77 78 // Copy input data to device. 79 float* device_x; 80 float* device_y; 81 checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float))); 82 checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float))); 83 checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), 84 cudaMemcpyHostToDevice)); 85 86 // Launch the kernel. 87 axpy<<<1, kDataLen>>>(a, device_x, device_y); 88 89 // Copy output data to host. 90 checkCudaErrors(cudaDeviceSynchronize()); 91 checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), 92 cudaMemcpyDeviceToHost)); 93 94 // Print the results. 95 for (int i = 0; i < kDataLen; ++i) { 96 std::cout << "y[" << i << "] = " << host_y[i] << "\n"; 97 } 98 99 checkCudaErrors(cudaDeviceReset()); 100 return 0; 101 } 102 103The command line for compilation is similar to what you would use for C++. 104 105.. code-block:: console 106 107 $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread 108 $ ./axpy 109 y[0] = 2 110 y[1] = 4 111 y[2] = 6 112 y[3] = 8 113 114Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the 115samples installed for this example. ``<CUDA install path>`` is the root 116directory where you installed CUDA SDK, typically ``/usr/local/cuda``. 117 118Optimizations 119============= 120 121CPU and GPU have different design philosophies and architectures. For example, a 122typical CPU has branch prediction, out-of-order execution, and is superscalar, 123whereas a typical GPU has none of these. Due to such differences, an 124optimization pipeline well-tuned for CPUs may be not suitable for GPUs. 125 126LLVM performs several general and CUDA-specific optimizations for GPUs. The 127list below shows some of the more important optimizations for GPUs. Most of 128them have been upstreamed to ``lib/Transforms/Scalar`` and 129``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a 130customizable target-independent optimization pipeline. 131 132* **Straight-line scalar optimizations**. These optimizations reduce redundancy 133 in straight-line code. Details can be found in the `design document for 134 straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. 135 136* **Inferring memory spaces**. `This optimization 137 <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_ 138 infers the memory space of an address so that the backend can emit faster 139 special loads and stores from it. Details can be found in the `design 140 document for memory space inference <https://goo.gl/5wH2Ct>`_. 141 142* **Aggressive loop unrooling and function inlining**. Loop unrolling and 143 function inlining need to be more aggressive for GPUs than for CPUs because 144 control flow transfer in GPU is more expensive. They also promote other 145 optimizations such as constant propagation and SROA which sometimes speed up 146 code by over 10x. An empirical inline threshold for GPUs is 1100. This 147 configuration has yet to be upstreamed with a target-specific optimization 148 pipeline. LLVM also provides `loop unrolling pragmas 149 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ 150 and ``__attribute__((always_inline))`` for programmers to force unrolling and 151 inling. 152 153* **Aggressive speculative execution**. `This transformation 154 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is 155 mainly for promoting straight-line scalar optimizations which are most 156 effective on code along dominator paths. 157 158* **Memory-space alias analysis**. `This alias analysis 159 <http://reviews.llvm.org/D12414>`_ infers that two pointers in different 160 special memory spaces do not alias. It has yet to be integrated to the new 161 alias analysis infrastructure; the new infrastructure does not run 162 target-specific alias analysis. 163 164* **Bypassing 64-bit divides**. `An existing optimization 165 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ 166 enabled in the NVPTX backend. 64-bit integer divides are much slower than 167 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit 168 divides in our benchmarks have a divisor and dividend which fit in 32-bits at 169 runtime. This optimization provides a fast path for this common case. 170