1DataFlowSanitizer Design Document
2=================================
3
4This document sets out the design for DataFlowSanitizer, a general
5dynamic data flow analysis.  Unlike other Sanitizer tools, this tool is
6not designed to detect a specific class of bugs on its own. Instead,
7it provides a generic dynamic data flow analysis framework to be used
8by clients to help detect application-specific issues within their
9own code.
10
11DataFlowSanitizer is a program instrumentation which can associate
12a number of taint labels with any data stored in any memory region
13accessible by the program. The analysis is dynamic, which means that
14it operates on a running program, and tracks how the labels propagate
15through that program. The tool shall support a large (>100) number
16of labels, such that programs which operate on large numbers of data
17items may be analysed with each data item being tracked separately.
18
19Use Cases
20---------
21
22This instrumentation can be used as a tool to help monitor how data
23flows from a program's inputs (sources) to its outputs (sinks).
24This has applications from a privacy/security perspective in that
25one can audit how a sensitive data item is used within a program and
26ensure it isn't exiting the program anywhere it shouldn't be.
27
28Interface
29---------
30
31A number of functions are provided which will create taint labels,
32attach labels to memory regions and extract the set of labels
33associated with a specific memory region. These functions are declared
34in the header file ``sanitizer/dfsan_interface.h``.
35
36.. code-block:: c
37
38  /// Creates and returns a base label with the given description and user data.
39  dfsan_label dfsan_create_label(const char *desc, void *userdata);
40
41  /// Sets the label for each address in [addr,addr+size) to \c label.
42  void dfsan_set_label(dfsan_label label, void *addr, size_t size);
43
44  /// Sets the label for each address in [addr,addr+size) to the union of the
45  /// current label for that address and \c label.
46  void dfsan_add_label(dfsan_label label, void *addr, size_t size);
47
48  /// Retrieves the label associated with the given data.
49  ///
50  /// The type of 'data' is arbitrary.  The function accepts a value of any type,
51  /// which can be truncated or extended (implicitly or explicitly) as necessary.
52  /// The truncation/extension operations will preserve the label of the original
53  /// value.
54  dfsan_label dfsan_get_label(long data);
55
56  /// Retrieves a pointer to the dfsan_label_info struct for the given label.
57  const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label);
58
59  /// Returns whether the given label label contains the label elem.
60  int dfsan_has_label(dfsan_label label, dfsan_label elem);
61
62  /// If the given label label contains a label with the description desc, returns
63  /// that label, else returns 0.
64  dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc);
65
66Taint label representation
67--------------------------
68
69As stated above, the tool must track a large number of taint
70labels. This poses an implementation challenge, as most multiple-label
71tainting systems assign one label per bit to shadow storage, and
72union taint labels using a bitwise or operation. This will not scale
73to clients which use hundreds or thousands of taint labels, as the
74label union operation becomes O(n) in the number of supported labels,
75and data associated with it will quickly dominate the live variable
76set, causing register spills and hampering performance.
77
78Instead, a low overhead approach is proposed which is best-case O(log\
79:sub:`2` n) during execution. The underlying assumption is that
80the required space of label unions is sparse, which is a reasonable
81assumption to make given that we are optimizing for the case where
82applications mostly copy data from one place to another, without often
83invoking the need for an actual union operation. The representation
84of a taint label is a 16-bit integer, and new labels are allocated
85sequentially from a pool. The label identifier 0 is special, and means
86that the data item is unlabelled.
87
88When a label union operation is requested at a join point (any
89arithmetic or logical operation with two or more operands, such as
90addition), the code checks whether a union is required, whether the
91same union has been requested before, and whether one union label
92subsumes the other. If so, it returns the previously allocated union
93label. If not, it allocates a new union label from the same pool used
94for new labels.
95
96Specifically, the instrumentation pass will insert code like this
97to decide the union label ``lu`` for a pair of labels ``l1``
98and ``l2``:
99
100.. code-block:: c
101
102  if (l1 == l2)
103    lu = l1;
104  else
105    lu = __dfsan_union(l1, l2);
106
107The equality comparison is outlined, to provide an early exit in
108the common cases where the program is processing unlabelled data, or
109where the two data items have the same label.  ``__dfsan_union`` is
110a runtime library function which performs all other union computation.
111
112Further optimizations are possible, for example if ``l1`` is known
113at compile time to be zero (e.g. it is derived from a constant),
114``l2`` can be used for ``lu``, and vice versa.
115
116Memory layout and label management
117----------------------------------
118
119The following is the current memory layout for Linux/x86\_64:
120
121+---------------+---------------+--------------------+
122|    Start      |    End        |        Use         |
123+===============+===============+====================+
124| 0x700000008000|0x800000000000 | application memory |
125+---------------+---------------+--------------------+
126| 0x200200000000|0x700000008000 |       unused       |
127+---------------+---------------+--------------------+
128| 0x200000000000|0x200200000000 |    union table     |
129+---------------+---------------+--------------------+
130| 0x000000010000|0x200000000000 |   shadow memory    |
131+---------------+---------------+--------------------+
132| 0x000000000000|0x000000010000 | reserved by kernel |
133+---------------+---------------+--------------------+
134
135Each byte of application memory corresponds to two bytes of shadow
136memory, which are used to store its taint label. As for LLVM SSA
137registers, we have not found it necessary to associate a label with
138each byte or bit of data, as some other tools do. Instead, labels are
139associated directly with registers.  Loads will result in a union of
140all shadow labels corresponding to bytes loaded (which most of the
141time will be short circuited by the initial comparison) and stores will
142result in a copy of the label to the shadow of all bytes stored to.
143
144Propagating labels through arguments
145------------------------------------
146
147In order to propagate labels through function arguments and return values,
148DataFlowSanitizer changes the ABI of each function in the translation unit.
149There are currently two supported ABIs:
150
151* Args -- Argument and return value labels are passed through additional
152  arguments and by modifying the return type.
153
154* TLS -- Argument and return value labels are passed through TLS variables
155  ``__dfsan_arg_tls`` and ``__dfsan_retval_tls``.
156
157The main advantage of the TLS ABI is that it is more tolerant of ABI mismatches
158(TLS storage is not shared with any other form of storage, whereas extra
159arguments may be stored in registers which under the native ABI are not used
160for parameter passing and thus could contain arbitrary values).  On the other
161hand the args ABI is more efficient and allows ABI mismatches to be more easily
162identified by checking for nonzero labels in nominally unlabelled programs.
163
164Implementing the ABI list
165-------------------------
166
167The `ABI list <DataFlowSanitizer.html#abi-list>`_ provides a list of functions
168which conform to the native ABI, each of which is callable from an instrumented
169program.  This is implemented by replacing each reference to a native ABI
170function with a reference to a function which uses the instrumented ABI.
171Such functions are automatically-generated wrappers for the native functions.
172For example, given the ABI list example provided in the user manual, the
173following wrappers will be generated under the args ABI:
174
175.. code-block:: llvm
176
177    define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) {
178    entry:
179      %2 = call i8* @malloc(i64 %0)
180      %3 = insertvalue { i8*, i16 } undef, i8* %2, 0
181      %4 = insertvalue { i8*, i16 } %3, i16 0, 1
182      ret { i8*, i16 } %4
183    }
184
185    define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) {
186    entry:
187      %2 = call i32 @tolower(i32 %0)
188      %3 = insertvalue { i32, i16 } undef, i32 %2, 0
189      %4 = insertvalue { i32, i16 } %3, i16 %1, 1
190      ret { i32, i16 } %4
191    }
192
193    define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) {
194    entry:
195      %labelreturn = alloca i16
196      %6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn)
197      %7 = load i16* %labelreturn
198      %8 = insertvalue { i8*, i16 } undef, i8* %6, 0
199      %9 = insertvalue { i8*, i16 } %8, i16 %7, 1
200      ret { i8*, i16 } %9
201    }
202
203As an optimization, direct calls to native ABI functions will call the
204native ABI function directly and the pass will compute the appropriate label
205internally.  This has the advantage of reducing the number of union operations
206required when the return value label is known to be zero (i.e. ``discard``
207functions, or ``functional`` functions with known unlabelled arguments).
208
209Checking ABI Consistency
210------------------------
211
212DFSan changes the ABI of each function in the module.  This makes it possible
213for a function with the native ABI to be called with the instrumented ABI,
214or vice versa, thus possibly invoking undefined behavior.  A simple way
215of statically detecting instances of this problem is to prepend the prefix
216"dfs$" to the name of each instrumented-ABI function.
217
218This will not catch every such problem; in particular function pointers passed
219across the instrumented-native barrier cannot be used on the other side.
220These problems could potentially be caught dynamically.
221