1
2# HotSort
3
4HotSort is a high-performance GPU-accelerated integer sorting library
5for Vulkan, CUDA and OpenCL compute APIs.
6
7HotSort's advantages include:
8
9* Ultra-fast sorting of 32‑bit or 64‑bit keys
10* Reaches peak throughput on small arrays
11* In-place sorting for low-memory environments
12* Strong scaling with number of multiprocessors
13* Low memory transactions relative to array size
14* A concurrency-friendly dense kernel grid
15* Support for GPU post-processing of sorted results
16
17HotSort is typically significantly faster than other GPU-accelerated
18implementations when sorting arrays of smaller than 500K-2M keys.
19
20## Benchmarks
21
22### Throughput
23
24Here is a throughput plot for HotSort on Vulkan and CUDA sorting
2532-bit and 64-bit keys on a 640-core Quadro M1200:
26
27![](images/hs_nvidia_sm35_u32_mkeys.svg)
28![](images/hs_nvidia_sm35_u64_mkeys.svg)
29
30HotSort throughput on Vulkan on an AMD V1807B APU (similar to a Ryzen 2400G) with DDR4-2666 RAM:
31
32![](images/hs_amd_gcn_mkeys.svg)
33
34HotSort throughput on Vulkan and OpenCL on an Intel HD 630:
35
36![](images/hs_intel_gen8_mkeys.svg)
37
38### Execution time
39
40Note that these sorting rates translate to sub-millisecond to
41multi-millisecond execution times on small GPUs:
42
43![](images/hs_nvidia_sm35_u32_msecs.svg)
44![](images/hs_nvidia_sm35_u64_msecs.svg)
45![](images/hs_amd_gcn_msecs.svg)
46![](images/hs_intel_gen8_msecs.svg)
47
48# Usage
49
50There are HotSort implementations for Vulkan, CUDA and OpenCL.
51
52Note that HotSort is a comparison sort and supports in-place sorting.
53
54There are also benchmarking examples for the
55[Vulkan](vk/bench/main.c), [CUDA](cuda/bench/main.c) and
56[OpenCL](cl/bench/main.c) implementations.
57
58*Not all targeted architectures have been tested.*
59
60## Vulkan
61
62The following architectures are supported:
63
64Vendor | Architecture                              | 32‑bit             | 64‑bit             | 32+32‑bit   | Notes
65-------|-------------------------------------------|:------------------:|:------------------:|:-----------:|------
66NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x:         | Not tested on all architectures
67NVIDIA | sm_30,sm_32,sm_53,sm_62                   | :x:                | :x:                | :x:         | Need to generate properly shaped kernels
68AMD    | GCN                                       | :white_check_mark: | :white_check_mark: | :x:         | Tested on Linux MESA 18.2
69Intel  | GEN8+                                     | :white_check_mark: | :white_check_mark: | :x:         | Good but the assumed *best-shaped* kernels aren't being used due to a compiler issue
70Intel  | APL/GLK using a 2x9 or 1x12 thread pool   | :x:                | :x:                | :x:         | Need to generate properly shaped kernels
71
72Add an arch-specific HotSort algorithm (aka "target") to your project
73by including a `.c` source and `.h` header file:
74
75Key Size | Source                                                                 | Header
76---------|------------------------------------------------------------------------|-------------------------------------------------------
7732‑bit   | ```hs/vk/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/vk/<vendor>/<arch>/u32/hs_target.h```
7864‑bit   | ```hs/vk/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/vk/<vendor>/<arch>/u64/hs_target.h```
79
80To sort `count` keys on Vulkan:
81
82```C
83#include "hs/vk/intel/gen8/u32/hs_target.h"
84
85...
86
87// create the Vulkan HotSort target
88struct hs_vk * hs = hs_vk_create(<address of target>,...);
89
90// allocate a descriptor set from a pool
91VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool);
92
93...
94
95// command buffer begin
96
97...
98
99// bind buffer(s) to a command buffer
100hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting
101
102// see how much padding may be required
103hs_vk_pad(hs,count,&count_padded_in,&count_padded_out);
104
105// append compute shaders to command buffer
106hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args
107
108...
109
110// command buffer end and queue submit
111
112...
113
114// release the Vulkan HotSort target
115hs_vk_release(hs,...);
116
117```
118
119The [`hs_vk.h`](vk/hs_vk.h) header file describes these functions in
120greater detail.
121
122## CUDA
123
124The following architectures are supported:
125
126Vendor | Architecture                                          | 32‑bit             | 64‑bit             | 32+32‑bit | Notes
127-------|-------------------------------------------------------|:------------------:|:------------------:|:---------:|------
128NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70             | :white_check_mark: | :white_check_mark: | :x:       |
129NVIDIA | sm_30,sm_32,sm_53,sm_62                               | :x:                | :x:                | :x:       | Need to generate properly shaped kernels
130
131Add an arch-specific HotSort target to your project by including a
132`.cu` CUDA source and `.h` header file:
133
134Key Size | Source                                          | Header
135---------|-------------------------------------------------|------------------------------------------
13632‑bit   | ```hs/cuda/sm_35/u32/hs_cuda_u32.cu``` | ```hs/cuda/sm_35/u32/hs_cuda.h```
13764‑bit   | ```hs/cuda/sm_35/u64/hs_cuda_u64.cu``` | ```hs/cuda/sm_35/u64/hs_cuda.h```
138
139Usage on CUDA is very simple.
140
141For example, to sort `count` keys:
142
143```C
144#include "hs/cuda/sm_35/u32/hs_cuda.h"
145
146...
147
148uint32_t count_padded_in, count_padded_out;
149
150hs_cuda_pad_u32(count,&count_padded_in,&count_padded_in);
151
152...
153
154hs_cuda_sort_u32(vin,vout, // or (vin,NULL,...) for in-place sorting
155                 count,count_padded_in,count_padded_out,
156                 true,
157                 stream0,stream1,stream2);
158```
159
160HotSort on CUDA requires two auxilary streams in order to maximize concurrency.
161
162The algorithm is guaranteed to complete on `stream0`.
163
164
165## OpenCL
166
167The following architectures are supported:
168
169Vendor | Architecture                            | 32‑bit             | 64‑bit             | 32+32‑bit | Notes
170-------|-----------------------------------------|:------------------:|:------------------:|:---------:|------
171Intel  | GEN8+                                   | :white_check_mark: | :white_check_mark: | :x:       | Due to a fragile compiler, the assumed best kernels are not being generated at this time
172Intel  | APL/GLK using a 2x9 or 1x12 thread pool | :x:                | :x:                | :x:       | Need to generate properly shaped kernels
173
174Add an arch-specific HotSort target to your project by including a
175`.c` source and `.h` header file:
176
177Key Size | Source                                                                 | Header
178---------|------------------------------------------------------------------------|-------------------------------------------------------
17932‑bit   | ```hs/cl/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/cl/<vendor>/<arch>/u32/hs_target.h```
18064‑bit   | ```hs/cl/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/cl/<vendor>/<arch>/u64/hs_target.h```
181
182Note that if the symbol `HS_DUMP_SOURCE` is not defined then the
183pre-compiled GEN9 binaries will be included.  These binaries may not
184be compatible with all GEN8+ devices and drivers.
185
186To sort `count` keys on OpenCL:
187
188```C
189// create the OpenCL HotSort target
190struct hs_cl * hs = hs_cl_create(<address of target>,...);
191
192...
193
194// see how much padding may be required
195hs_cl_pad(hs,count,&count_padded_in,&count_padded_out);
196
197// enqueue compute kernels
198hs_cl_sort(hs,cq,...,vin,vout,...); // or (...,vin,NULL,...) for in-place sorting
199
200...
201
202// release the OpenCL HotSort target
203hs_cl_release(hs,...);
204
205```
206
207The [`hs_cl.h`](cl/hs_cl.h) header file describes these functions in
208greater detail.
209
210# Background
211
212The HotSort sorting algorithm was created in 2012 and generalized in
2132015 to support GPUs that benefit from non-power-of-two workgroups.
214
215The objective of HotSort is to achieve high throughput as *early* as
216possible on small GPUs when sorting modestly-sized arrays ― 1,000s to
217100s of thousands of 64‑bit keys.
218
219HotSort uses both well-known and obscure properties of bitonic
220sequences to create a novel mapping of keys onto data-parallel devices
221like GPUs.
222
223## Overview
224
225The overall flow of the HotSort algorithm is below.  Kernel launches
226are in italics.
227
2281. For each workgroup of slabs:
229   1. For each slab in the workgroup:
230      1. *Slab Load*
231      1. *Slab Sort*
232   1. Until all slabs in the workgroup are merged:
233      1. *Multi-Slab Flip Merge*
234   1. *Slab Store*
2351. Until all slabs are merged:
236   1. *Streaming Flip Merge*
237   1. If necessary, *Streaming Half Merge*
238   1. If necessary, *Multi-Slab Half Merge*
239   1. If necessary, *Slab Half Merge*
240   1. If complete:
241      1. Optionally, *Report Key Changes*
242      1. Optionally, *Slab Transpose & Store*
243   1. Otherwise: *Slab Store*
2441. Done
245
246## Sorting
247
248The algorithm begins with a very *dense* per-multiprocessor block
249sorting algorithm that loads a "slab" of keys into a subgroup's
250registers, sorts the slabs, merges all slabs in the workgroup, and
251stores the slabs back to global memory.
252
253In the slab sorting phase, each lane of a subgroup executes a bitonic
254sorting network on its registers and successively merges lanes until
255the slab of registers is sorted in serpentine order.
256
257![](images/hs_sorted_slab.svg)
258
259## Merging
260
261HotSort has several different merging strategies.
262
263The merging kernels leverage the multiprocessor's register file by
264loading, merging and storing a large number of strided slab rows
265without using local memory.
266
267The merging kernels exploit the bitonic sequence property that
268interleaved subsequences of a bitonic sequence are also bitonic
269sequences.  This property also holds for non-power-of-two sequences.
270
271As an example, the *Streaming Flip Merge* kernel is illustrated below:
272
273![](images/hs_flip_merge.svg)
274
275# Future Enhancements
276
277## Hybrid improved merging
278
279HotSort's initial sorting and merging phases are performed on bitonic
280sequences.  Because of this, throughput decreases as the problem size
281increases.
282
283A hybrid algorithm that combined HotSort's block sorter and several
284rounds of merging with a state-of-the-art GPU merging algorithm would
285probably improve the algorithm's performance on larger arrays.
286
287## Reenable support for devices lacking shuffle functions
288
289The original version of HotSort ran on pre-Kepler GPUs without
290intra-warp/inter-lane shuffling ― reenable this capability.
291
292## Metal support
293
294Modify the HotSort generator to support Metal targets.
295