1CUDA Module Introduction {#cuda_intro}
2========================
3
4General Information
5-------------------
6
7The OpenCV CUDA module is a set of classes and functions to utilize CUDA computational capabilities.
8It is implemented using NVIDIA\* CUDA\* Runtime API and supports only NVIDIA GPUs. The OpenCV CUDA
9module includes utility functions, low-level vision primitives, and high-level algorithms. The
10utility functions and low-level primitives provide a powerful infrastructure for developing fast
11vision algorithms taking advantage of CUDA whereas the high-level functionality includes some
12state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others)
13ready to be used by the application developers.
14
15The CUDA module is designed as a host-level API. This means that if you have pre-compiled OpenCV
16CUDA binaries, you are not required to have the CUDA Toolkit installed or write any extra code to
17make use of the CUDA.
18
19The OpenCV CUDA module is designed for ease of use and does not require any knowledge of CUDA.
20Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest
21performance. It is helpful to understand the cost of various operations, what the GPU does, what the
22preferred data formats are, and so on. The CUDA module is an effective instrument for quick
23implementation of CUDA-accelerated computer vision algorithms. However, if your algorithm involves
24many simple operations, then, for the best possible performance, you may still need to write your
25own kernels to avoid extra write and read operations on the intermediate results.
26
27To enable CUDA support, configure OpenCV using CMake with WITH\_CUDA=ON . When the flag is set and
28if CUDA is installed, the full-featured OpenCV CUDA module is built. Otherwise, the module is still
29built but at runtime all functions from the module throw Exception with CV\_GpuNotSupported error
30code, except for cuda::getCudaEnabledDeviceCount(). The latter function returns zero GPU count in
31this case. Building OpenCV without CUDA support does not perform device code compilation, so it does
32not require the CUDA Toolkit installed. Therefore, using the cuda::getCudaEnabledDeviceCount()
33function, you can implement a high-level algorithm that will detect GPU presence at runtime and
34choose an appropriate implementation (CPU or GPU) accordingly.
35
36Compilation for Different NVIDIA\* Platforms
37--------------------------------------------
38
39NVIDIA\* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX).
40Binary code often implies a specific GPU architecture and generation, so the compatibility with
41other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the
42set of capabilities or features. Depending on the selected virtual platform, some of the
43instructions are emulated or disabled, even if the real hardware supports all the features.
44
45At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT
46compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails. By
47default, the OpenCV CUDA module includes:
48
49\*
50   Binaries for compute capabilities 1.3 and 2.0 (controlled by CUDA\_ARCH\_BIN in CMake)
51
52\*
53   PTX code for compute capabilities 1.1 and 1.3 (controlled by CUDA\_ARCH\_PTX in CMake)
54
55This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer
56platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the
57PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw
58Exception. For platforms where JIT compilation is performed first, the run is slow.
59
60On a GPU with CC 1.0, you can still compile the CUDA module and most of the functions will run
61flawlessly. To achieve this, add "1.0" to the list of binaries, for example,
62CUDA\_ARCH\_BIN="1.0 1.3 2.0" . The functions that cannot be run on CC 1.0 GPUs throw an exception.
63
64You can always determine at runtime whether the OpenCV GPU-built binaries (or PTX code) are
65compatible with your GPU. The function cuda::DeviceInfo::isCompatible returns the compatibility
66status (true/false).
67
68Utilizing Multiple GPUs
69-----------------------
70
71In the current version, each of the OpenCV CUDA algorithms can use only a single GPU. So, to utilize
72multiple GPUs, you have to manually distribute the work between GPUs. Switching active device can be
73done using cuda::setDevice() function. For more details please read Cuda C Programming Guide.
74
75While developing algorithms for multiple GPUs, note a data passing overhead. For primitive functions
76and small images, it can be significant, which may eliminate all the advantages of having multiple
77GPUs. But for high-level algorithms, consider using multi-GPU acceleration. For example, the Stereo
78Block Matching algorithm has been successfully parallelized using the following algorithm:
79
801.  Split each image of the stereo pair into two horizontal overlapping stripes.
812.  Process each pair of stripes (from the left and right images) on a separate Fermi\* GPU.
823.  Merge the results into a single disparity map.
83
84With this algorithm, a dual GPU gave a 180% performance increase comparing to the single Fermi GPU.
85For a source code example, see <https://github.com/Itseez/opencv/tree/master/samples/gpu/>.
86