1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2          "http://www.w3.org/TR/html4/strict.dtd">
3<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
4<html>
5<head>
6  <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
7  <title>&lt;atomic&gt; design</title>
8  <link type="text/css" rel="stylesheet" href="menu.css">
9  <link type="text/css" rel="stylesheet" href="content.css">
10</head>
11
12<body>
13<div id="menu">
14  <div>
15    <a href="https://llvm.org/">LLVM Home</a>
16  </div>
17
18  <div class="submenu">
19    <label>libc++ Info</label>
20    <a href="/index.html">About</a>
21  </div>
22
23  <div class="submenu">
24    <label>Quick Links</label>
25    <a href="https://lists.llvm.org/mailman/listinfo/cfe-dev">cfe-dev</a>
26    <a href="https://lists.llvm.org/mailman/listinfo/cfe-commits">cfe-commits</a>
27    <a href="https://bugs.llvm.org/">Bug Reports</a>
28    <a href="https://github.com/llvm/llvm-project/tree/master/libcxx/">Browse Sources</a>
29  </div>
30</div>
31
32<div id="content">
33  <!--*********************************************************************-->
34  <h1>&lt;atomic&gt; design</h1>
35  <!--*********************************************************************-->
36
37<p>
38The <tt>&lt;atomic&gt;</tt> header is one of the most closely coupled headers to
39the compiler.  Ideally when you invoke any function from
40<tt>&lt;atomic&gt;</tt>, it should result in highly optimized assembly being
41inserted directly into your application ...  assembly that is not otherwise
42representable by higher level C or C++ expressions.  The design of the libc++
43<tt>&lt;atomic&gt;</tt> header started with this goal in mind.  A secondary, but
44still very important goal is that the compiler should have to do minimal work to
45facilitate the implementation of <tt>&lt;atomic&gt;</tt>.  Without this second
46goal, then practically speaking, the libc++ <tt>&lt;atomic&gt;</tt> header would
47be doomed to be a barely supported, second class citizen on almost every
48platform.
49</p>
50
51<p>Goals:</p>
52
53<blockquote><ul>
54<li>Optimal code generation for atomic operations</li>
55<li>Minimal effort for the compiler to achieve goal 1 on any given platform</li>
56<li>Conformance to the C++0X draft standard</li>
57</ul></blockquote>
58
59<p>
60The purpose of this document is to inform compiler writers what they need to do
61to enable a high performance libc++ <tt>&lt;atomic&gt;</tt> with minimal effort.
62</p>
63
64<h2>The minimal work that must be done for a conforming <tt>&lt;atomic&gt;</tt></h2>
65
66<p>
67The only "atomic" operations that must actually be lock free in
68<tt>&lt;atomic&gt;</tt> are represented by the following compiler intrinsics:
69</p>
70
71<blockquote><pre>
72__atomic_flag__
73__atomic_exchange_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr)
74{
75    unique_lock&lt;mutex&gt; _(some_mutex);
76    __atomic_flag__ result = *obj;
77    *obj = desr;
78    return result;
79}
80
81void
82__atomic_store_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr)
83{
84    unique_lock&lt;mutex&gt; _(some_mutex);
85    *obj = desr;
86}
87</pre></blockquote>
88
89<p>
90Where:
91</p>
92
93<blockquote><ul>
94<li>
95If <tt>__has_feature(__atomic_flag)</tt> evaluates to 1 in the preprocessor then
96the compiler must define <tt>__atomic_flag__</tt> (e.g. as a typedef to
97<tt>int</tt>).
98</li>
99<li>
100If <tt>__has_feature(__atomic_flag)</tt> evaluates to 0 in the preprocessor then
101the library defines <tt>__atomic_flag__</tt> as a typedef to <tt>bool</tt>.
102</li>
103<li>
104<p>
105To communicate that the above intrinsics are available, the compiler must
106arrange for <tt>__has_feature</tt> to return 1 when fed the intrinsic name
107appended with an '_' and the mangled type name of <tt>__atomic_flag__</tt>.
108</p>
109<p>
110For example if <tt>__atomic_flag__</tt> is <tt>unsigned int</tt>:
111</p>
112<blockquote><pre>
113__has_feature(__atomic_flag) == 1
114__has_feature(__atomic_exchange_seq_cst_j) == 1
115__has_feature(__atomic_store_seq_cst_j) == 1
116
117typedef unsigned int __atomic_flag__;
118
119unsigned int __atomic_exchange_seq_cst(unsigned int volatile*, unsigned int)
120{
121   // ...
122}
123
124void __atomic_store_seq_cst(unsigned int volatile*, unsigned int)
125{
126   // ...
127}
128</pre></blockquote>
129</li>
130</ul></blockquote>
131
132<p>
133That's it!  Compiler writers do the above and you've got a fully conforming
134(though sub-par performance) <tt>&lt;atomic&gt;</tt> header!
135</p>
136
137<h2>Recommended work for a higher performance <tt>&lt;atomic&gt;</tt></h2>
138
139<p>
140It would be good if the above intrinsics worked with all integral types plus
141<tt>void*</tt>.  Because this may not be possible to do in a lock-free manner for
142all integral types on all platforms, a compiler must communicate each type that
143an intrinsic works with.  For example if <tt>__atomic_exchange_seq_cst</tt> works
144for all types except for <tt>long long</tt> and <tt>unsigned long long</tt>
145then:
146</p>
147
148<blockquote><pre>
149__has_feature(__atomic_exchange_seq_cst_b) == 1  // bool
150__has_feature(__atomic_exchange_seq_cst_c) == 1  // char
151__has_feature(__atomic_exchange_seq_cst_a) == 1  // signed char
152__has_feature(__atomic_exchange_seq_cst_h) == 1  // unsigned char
153__has_feature(__atomic_exchange_seq_cst_Ds) == 1 // char16_t
154__has_feature(__atomic_exchange_seq_cst_Di) == 1 // char32_t
155__has_feature(__atomic_exchange_seq_cst_w) == 1  // wchar_t
156__has_feature(__atomic_exchange_seq_cst_s) == 1  // short
157__has_feature(__atomic_exchange_seq_cst_t) == 1  // unsigned short
158__has_feature(__atomic_exchange_seq_cst_i) == 1  // int
159__has_feature(__atomic_exchange_seq_cst_j) == 1  // unsigned int
160__has_feature(__atomic_exchange_seq_cst_l) == 1  // long
161__has_feature(__atomic_exchange_seq_cst_m) == 1  // unsigned long
162__has_feature(__atomic_exchange_seq_cst_Pv) == 1 // void*
163</pre></blockquote>
164
165<p>
166Note that only the <tt>__has_feature</tt> flag is decorated with the argument
167type.  The name of the compiler intrinsic is not decorated, but instead works
168like a C++ overloaded function.
169</p>
170
171<p>
172Additionally there are other intrinsics besides
173<tt>__atomic_exchange_seq_cst</tt> and <tt>__atomic_store_seq_cst</tt>.  They
174are optional.  But if the compiler can generate faster code than provided by the
175library, then clients will benefit from the compiler writer's expertise and
176knowledge of the targeted platform.
177</p>
178
179<p>
180Below is the complete list of <i>sequentially consistent</i> intrinsics, and
181their library implementations.  Template syntax is used to indicate the desired
182overloading for integral and void* types.  The template does not represent a
183requirement that the intrinsic operate on <em>any</em> type!
184</p>
185
186<blockquote><pre>
187T is one of:  bool, char, signed char, unsigned char, short, unsigned short,
188              int, unsigned int, long, unsigned long,
189              long long, unsigned long long, char16_t, char32_t, wchar_t, void*
190
191template &lt;class T&gt;
192T
193__atomic_load_seq_cst(T const volatile* obj)
194{
195    unique_lock&lt;mutex&gt; _(some_mutex);
196    return *obj;
197}
198
199template &lt;class T&gt;
200void
201__atomic_store_seq_cst(T volatile* obj, T desr)
202{
203    unique_lock&lt;mutex&gt; _(some_mutex);
204    *obj = desr;
205}
206
207template &lt;class T&gt;
208T
209__atomic_exchange_seq_cst(T volatile* obj, T desr)
210{
211    unique_lock&lt;mutex&gt; _(some_mutex);
212    T r = *obj;
213    *obj = desr;
214    return r;
215}
216
217template &lt;class T&gt;
218bool
219__atomic_compare_exchange_strong_seq_cst_seq_cst(T volatile* obj, T* exp, T desr)
220{
221    unique_lock&lt;mutex&gt; _(some_mutex);
222    if (std::memcmp(const_cast&lt;T*&gt;(obj), exp, sizeof(T)) == 0)
223    {
224        std::memcpy(const_cast&lt;T*&gt;(obj), &amp;desr, sizeof(T));
225        return true;
226    }
227    std::memcpy(exp, const_cast&lt;T*&gt;(obj), sizeof(T));
228    return false;
229}
230
231template &lt;class T&gt;
232bool
233__atomic_compare_exchange_weak_seq_cst_seq_cst(T volatile* obj, T* exp, T desr)
234{
235    unique_lock&lt;mutex&gt; _(some_mutex);
236    if (std::memcmp(const_cast&lt;T*&gt;(obj), exp, sizeof(T)) == 0)
237    {
238        std::memcpy(const_cast&lt;T*&gt;(obj), &amp;desr, sizeof(T));
239        return true;
240    }
241    std::memcpy(exp, const_cast&lt;T*&gt;(obj), sizeof(T));
242    return false;
243}
244
245T is one of:  char, signed char, unsigned char, short, unsigned short,
246              int, unsigned int, long, unsigned long,
247              long long, unsigned long long, char16_t, char32_t, wchar_t
248
249template &lt;class T&gt;
250T
251__atomic_fetch_add_seq_cst(T volatile* obj, T operand)
252{
253    unique_lock&lt;mutex&gt; _(some_mutex);
254    T r = *obj;
255    *obj += operand;
256    return r;
257}
258
259template &lt;class T&gt;
260T
261__atomic_fetch_sub_seq_cst(T volatile* obj, T operand)
262{
263    unique_lock&lt;mutex&gt; _(some_mutex);
264    T r = *obj;
265    *obj -= operand;
266    return r;
267}
268
269template &lt;class T&gt;
270T
271__atomic_fetch_and_seq_cst(T volatile* obj, T operand)
272{
273    unique_lock&lt;mutex&gt; _(some_mutex);
274    T r = *obj;
275    *obj &amp;= operand;
276    return r;
277}
278
279template &lt;class T&gt;
280T
281__atomic_fetch_or_seq_cst(T volatile* obj, T operand)
282{
283    unique_lock&lt;mutex&gt; _(some_mutex);
284    T r = *obj;
285    *obj |= operand;
286    return r;
287}
288
289template &lt;class T&gt;
290T
291__atomic_fetch_xor_seq_cst(T volatile* obj, T operand)
292{
293    unique_lock&lt;mutex&gt; _(some_mutex);
294    T r = *obj;
295    *obj ^= operand;
296    return r;
297}
298
299void*
300__atomic_fetch_add_seq_cst(void* volatile* obj, ptrdiff_t operand)
301{
302    unique_lock&lt;mutex&gt; _(some_mutex);
303    void* r = *obj;
304    (char*&amp;)(*obj) += operand;
305    return r;
306}
307
308void*
309__atomic_fetch_sub_seq_cst(void* volatile* obj, ptrdiff_t operand)
310{
311    unique_lock&lt;mutex&gt; _(some_mutex);
312    void* r = *obj;
313    (char*&amp;)(*obj) -= operand;
314    return r;
315}
316
317void __atomic_thread_fence_seq_cst()
318{
319    unique_lock&lt;mutex&gt; _(some_mutex);
320}
321
322void __atomic_signal_fence_seq_cst()
323{
324    unique_lock&lt;mutex&gt; _(some_mutex);
325}
326</pre></blockquote>
327
328<p>
329One should consult the (currently draft)
330<a href="https://wg21.link/n3126">C++ standard</a>
331for the details of the definitions for these operations.  For example
332<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> is allowed to fail
333spuriously while <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> is
334not.
335</p>
336
337<p>
338If on your platform the lock-free definition of
339<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> would be the same as
340<tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt>, you may omit the
341<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> intrinsic without a
342performance cost.  The library will prefer your implementation of
343<tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> over its own
344definition for implementing
345<tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt>.  That is, the library
346will arrange for <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> to call
347<tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> if you supply an
348intrinsic for the strong version but not the weak.
349</p>
350
351<h2>Taking advantage of weaker memory synchronization</h2>
352
353<p>
354So far all of the intrinsics presented require a <em>sequentially
355consistent</em> memory ordering.  That is, no loads or stores can move across
356the operation (just as if the library had locked that internal mutex).  But
357<tt>&lt;atomic&gt;</tt> supports weaker memory ordering operations.  In all,
358there are six memory orderings (listed here from strongest to weakest):
359</p>
360
361<blockquote><pre>
362memory_order_seq_cst
363memory_order_acq_rel
364memory_order_release
365memory_order_acquire
366memory_order_consume
367memory_order_relaxed
368</pre></blockquote>
369
370<p>
371(See the
372<a href="https://wg21.link/n3126">C++ standard</a>
373for the detailed definitions of each of these orderings).
374</p>
375
376<p>
377On some platforms, the compiler vendor can offer some or even all of the above
378intrinsics at one or more weaker levels of memory synchronization.  This might
379lead for example to not issuing an <tt>mfence</tt> instruction on the x86.
380</p>
381
382<p>
383If the compiler does not offer any given operation, at any given memory ordering
384level, the library will automatically attempt to call the next highest memory
385ordering operation.  This continues up to <tt>seq_cst</tt>, and if that doesn't
386exist, then the library takes over and does the job with a <tt>mutex</tt>.  This
387is a compile-time search &amp; selection operation.  At run time, the
388application will only see the few inlined assembly instructions for the selected
389intrinsic.
390</p>
391
392<p>
393Each intrinsic is appended with the 7-letter name of the memory ordering it
394addresses.  For example a <tt>load</tt> with <tt>relaxed</tt> ordering is
395defined by:
396</p>
397
398<blockquote><pre>
399T __atomic_load_relaxed(const volatile T* obj);
400</pre></blockquote>
401
402<p>
403And announced with:
404</p>
405
406<blockquote><pre>
407__has_feature(__atomic_load_relaxed_b) == 1  // bool
408__has_feature(__atomic_load_relaxed_c) == 1  // char
409__has_feature(__atomic_load_relaxed_a) == 1  // signed char
410...
411</pre></blockquote>
412
413<p>
414The <tt>__atomic_compare_exchange_strong(weak)</tt> intrinsics are parameterized
415on two memory orderings.  The first ordering applies when the operation returns
416<tt>true</tt> and the second ordering applies when the operation returns
417<tt>false</tt>.
418</p>
419
420<p>
421Not every memory ordering is appropriate for every operation.  <tt>exchange</tt>
422and the <tt>fetch_<i>op</i></tt> operations support all 6.  But <tt>load</tt>
423only supports <tt>relaxed</tt>, <tt>consume</tt>, <tt>acquire</tt> and <tt>seq_cst</tt>.
424<tt>store</tt>
425only supports <tt>relaxed</tt>, <tt>release</tt>, and <tt>seq_cst</tt>.  The
426<tt>compare_exchange</tt> operations support the following 16 combinations out
427of the possible 36:
428</p>
429
430<blockquote><pre>
431relaxed_relaxed
432consume_relaxed
433consume_consume
434acquire_relaxed
435acquire_consume
436acquire_acquire
437release_relaxed
438release_consume
439release_acquire
440acq_rel_relaxed
441acq_rel_consume
442acq_rel_acquire
443seq_cst_relaxed
444seq_cst_consume
445seq_cst_acquire
446seq_cst_seq_cst
447</pre></blockquote>
448
449<p>
450Again, the compiler supplies intrinsics only for the strongest orderings where
451it can make a difference.  The library takes care of calling the weakest
452supplied intrinsic that is as strong or stronger than the customer asked for.
453</p>
454
455</div>
456</body>
457</html>
458