1CUDA Module Introduction {#cuda_intro} 2======================== 3 4General Information 5------------------- 6 7The OpenCV CUDA module is a set of classes and functions to utilize CUDA computational capabilities. 8It is implemented using NVIDIA\* CUDA\* Runtime API and supports only NVIDIA GPUs. The OpenCV CUDA 9module includes utility functions, low-level vision primitives, and high-level algorithms. The 10utility functions and low-level primitives provide a powerful infrastructure for developing fast 11vision algorithms taking advantage of CUDA whereas the high-level functionality includes some 12state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others) 13ready to be used by the application developers. 14 15The CUDA module is designed as a host-level API. This means that if you have pre-compiled OpenCV 16CUDA binaries, you are not required to have the CUDA Toolkit installed or write any extra code to 17make use of the CUDA. 18 19The OpenCV CUDA module is designed for ease of use and does not require any knowledge of CUDA. 20Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest 21performance. It is helpful to understand the cost of various operations, what the GPU does, what the 22preferred data formats are, and so on. The CUDA module is an effective instrument for quick 23implementation of CUDA-accelerated computer vision algorithms. However, if your algorithm involves 24many simple operations, then, for the best possible performance, you may still need to write your 25own kernels to avoid extra write and read operations on the intermediate results. 26 27To enable CUDA support, configure OpenCV using CMake with WITH\_CUDA=ON . When the flag is set and 28if CUDA is installed, the full-featured OpenCV CUDA module is built. Otherwise, the module is still 29built but at runtime all functions from the module throw Exception with CV\_GpuNotSupported error 30code, except for cuda::getCudaEnabledDeviceCount(). The latter function returns zero GPU count in 31this case. Building OpenCV without CUDA support does not perform device code compilation, so it does 32not require the CUDA Toolkit installed. Therefore, using the cuda::getCudaEnabledDeviceCount() 33function, you can implement a high-level algorithm that will detect GPU presence at runtime and 34choose an appropriate implementation (CPU or GPU) accordingly. 35 36Compilation for Different NVIDIA\* Platforms 37-------------------------------------------- 38 39NVIDIA\* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX). 40Binary code often implies a specific GPU architecture and generation, so the compatibility with 41other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the 42set of capabilities or features. Depending on the selected virtual platform, some of the 43instructions are emulated or disabled, even if the real hardware supports all the features. 44 45At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT 46compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails. By 47default, the OpenCV CUDA module includes: 48 49\* 50 Binaries for compute capabilities 1.3 and 2.0 (controlled by CUDA\_ARCH\_BIN in CMake) 51 52\* 53 PTX code for compute capabilities 1.1 and 1.3 (controlled by CUDA\_ARCH\_PTX in CMake) 54 55This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer 56platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the 57PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw 58Exception. For platforms where JIT compilation is performed first, the run is slow. 59 60On a GPU with CC 1.0, you can still compile the CUDA module and most of the functions will run 61flawlessly. To achieve this, add "1.0" to the list of binaries, for example, 62CUDA\_ARCH\_BIN="1.0 1.3 2.0" . The functions that cannot be run on CC 1.0 GPUs throw an exception. 63 64You can always determine at runtime whether the OpenCV GPU-built binaries (or PTX code) are 65compatible with your GPU. The function cuda::DeviceInfo::isCompatible returns the compatibility 66status (true/false). 67 68Utilizing Multiple GPUs 69----------------------- 70 71In the current version, each of the OpenCV CUDA algorithms can use only a single GPU. So, to utilize 72multiple GPUs, you have to manually distribute the work between GPUs. Switching active device can be 73done using cuda::setDevice() function. For more details please read Cuda C Programming Guide. 74 75While developing algorithms for multiple GPUs, note a data passing overhead. For primitive functions 76and small images, it can be significant, which may eliminate all the advantages of having multiple 77GPUs. But for high-level algorithms, consider using multi-GPU acceleration. For example, the Stereo 78Block Matching algorithm has been successfully parallelized using the following algorithm: 79 801. Split each image of the stereo pair into two horizontal overlapping stripes. 812. Process each pair of stripes (from the left and right images) on a separate Fermi\* GPU. 823. Merge the results into a single disparity map. 83 84With this algorithm, a dual GPU gave a 180% performance increase comparing to the single Fermi GPU. 85For a source code example, see <https://github.com/Itseez/opencv/tree/master/samples/gpu/>. 86