1================ 2Initializer List 3================ 4This discussion took place in https://reviews.llvm.org/D35216 5"Escape symbols when creating std::initializer_list". 6 7It touches problems of modelling C++ standard library constructs in general, 8including modelling implementation-defined fields within C++ standard library 9objects, in particular constructing objects into pointers held by such fields, 10and separation of responsibilities between analyzer's core and checkers. 11 12**Artem:** 13 14I've seen a few false positives that appear because we construct 15C++11 std::initializer_list objects with brace initializers, and such 16construction is not properly modeled. For instance, if a new object is 17constructed on the heap only to be put into a brace-initialized STL container, 18the object is reported to be leaked. 19 20Approach (0): This can be trivially fixed by this patch, which causes pointers 21passed into initializer list expressions to immediately escape. 22 23This fix is overly conservative though. So i did a bit of investigation as to 24how model std::initializer_list better. 25 26According to the standard, ``std::initializer_list<T>`` is an object that has 27methods ``begin(), end(), and size()``, where ``begin()`` returns a pointer to continuous 28array of ``size()`` objects of type T, and end() is equal to begin() plus size(). 29The standard does hint that it should be possible to implement 30``std::initializer_list<T>`` as a pair of pointers, or as a pointer and a size 31integer, however specific fields that the object would contain are an 32implementation detail. 33 34Ideally, we should be able to model the initializer list's methods precisely. 35Or, at least, it should be possible to explain to the analyzer that the list 36somehow "takes hold" of the values put into it. Initializer lists can also be 37copied, which is a separate story that i'm not trying to address here. 38 39The obvious approach to modeling ``std::initializer_list`` in a checker would be to 40construct a SymbolMetadata for the memory region of the initializer list object, 41which would be of type ``T*`` and represent ``begin()``, so we'd trivially model ``begin()`` 42as a function that returns this symbol. The array pointed to by that symbol 43would be ``bindLoc()``ed to contain the list's contents (probably as a ``CompoundVal`` 44to produce less bindings in the store). Extent of this array would represent 45``size()`` and would be equal to the length of the list as written. 46 47So this sounds good, however apparently it does nothing to address our false 48positives: when the list escapes, our ``RegionStoreManager`` is not magically 49guessing that the metadata symbol attached to it, together with its contents, 50should also escape. In fact, it's impossible to trigger a pointer escape from 51within the checker. 52 53Approach (1): If only we enabled ``ProgramState::bindLoc(..., notifyChanges=true)`` 54to cause pointer escapes (not only region changes) (which sounds like the right 55thing to do anyway) such checker would be able to solve the false positives by 56triggering escapes when binding list elements to the list. However, it'd be as 57conservative as the current patch's solution. Ideally, we do not want escapes to 58happen so early. Instead, we'd prefer them to be delayed until the list itself 59escapes. 60 61So i believe that escaping metadata symbols whenever their base regions escape 62would be the right thing to do. Currently we didn't think about that because we 63had neither pointer-type metadatas nor non-pointer escapes. 64 65Approach (2): We could teach the Store to scan itself for bindings to 66metadata-symbolic-based regions during scanReachableSymbols() whenever a region 67turns out to be reachable. This requires no work on checker side, but it sounds 68performance-heavy. 69 70Approach (3): We could let checkers maintain the set of active metadata symbols 71in the program state (ideally somewhere in the Store, which sounds weird but 72causes the smallest amount of layering violations), so that the core knew what 73to escape. This puts a stress on the checkers, but with a smart data map it 74wouldn't be a problem. 75 76Approach (4): We could allow checkers to trigger pointer escapes in arbitrary 77moments. If we allow doing this within ``checkPointerEscape`` callback itself, we 78would be able to express facts like "when this region escapes, that metadata 79symbol attached to it should also escape". This sounds like an ultimate freedom, 80with maximum stress on the checkers - still not too much stress when we have 81smart data maps. 82 83I'm personally liking the approach (2) - it should be possible to avoid 84performance overhead, and clarity seems nice. 85 86**Gabor:** 87 88At this point, I am a bit wondering about two questions. 89 90* When should something belong to a checker and when should something belong to the engine? 91 Sometimes we model library aspects in the engine and model language constructs in checkers. 92 93* What is the checker programming model that we are aiming for? Maximum freedom or more easy checker development? 94 95I think if we aim for maximum freedom, we do not need to worry about the 96potential stress on checkers, and we can introduce abstractions to mitigate that 97later on. 98If we want to simplify the API, then maybe it makes more sense to move language 99construct modeling to the engine when the checker API is not sufficient instead 100of complicating the API. 101 102Right now I have no preference or objections between the alternatives but there 103are some random thoughts: 104 105* Maybe it would be great to have a guideline how to evolve the analyzer and 106 follow it, so it can help us to decide in similar situations 107 108* I do care about performance in this case. The reason is that we have a 109 limited performance budget. And I think we should not expect most of the checker 110 writers to add modeling of language constructs. So, in my opinion, it is ok to 111 have less nice/more verbose API for language modeling if we can have better 112 performance this way, since it only needs to be done once, and is done by the 113 framework developers. 114 115**Artem:** These are some great questions, i guess it'd be better to discuss 116them more openly. As a quick dump of my current mood: 117 118* To me it seems obvious that we need to aim for a checker API that is both 119 simple and powerful. This can probably by keeping the API as powerful as 120 necessary while providing a layer of simple ready-made solutions on top of it. 121 Probably a few reusable components for assembling checkers. And this layer 122 should ideally be pleasant enough to work with, so that people would prefer to 123 extend it when something is lacking, instead of falling back to the complex 124 omnipotent API. I'm thinking of AST matchers vs. AST visitors as a roughly 125 similar situation: matchers are not omnipotent, but they're so nice. 126 127* Separation between core and checkers is usually quite strange. Once we have 128 shared state traits, i generally wouldn't mind having region store or range 129 constraint manager as checkers (though it's probably not worth it to transform 130 them - just a mood). The main thing to avoid here would be the situation when 131 the checker overwrites stuff written by the core because it thinks it has a 132 better idea what's going on, so the core should provide a good default behavior. 133 134* Yeah, i totally care about performance as well, and if i try to implement 135 approach, i'd make sure it's good. 136 137**Artem:** 138 139> Approach (2): We could teach the Store to scan itself for bindings to 140> metadata-symbolic-based regions during scanReachableSymbols() whenever 141> a region turns out to be reachable. This requires no work on checker side, 142> but it sounds performance-heavy. 143 144Nope, this approach is wrong. Metadata symbols may become out-of-date: when the 145object changes, metadata symbols attached to it aren't changing (because symbols 146simply don't change). The same metadata may have different symbols to denote its 147value in different moments of time, but at most one of them represents the 148actual metadata value. So we'd be escaping more stuff than necessary. 149 150If only we had "ghost fields" 151(https://lists.llvm.org/pipermail/cfe-dev/2016-May/049000.html), it would have 152been much easier, because the ghost field would only contain the actual 153metadata, and the Store would always know about it. This example adds to my 154belief that ghost fields are exactly what we need for most C++ checkers. 155 156**Devin:** 157 158In this case, I would be fine with some sort of 159AbstractStorageMemoryRegion that meant "here is a memory region and somewhere 160reachable from here exists another region of type T". Or even multiple regions 161with different identifiers. This wouldn't specify how the memory is reachable, 162but it would allow for transfer functions to get at those regions and it would 163allow for invalidation. 164 165For ``std::initializer_list`` this reachable region would the region for the backing 166array and the transfer functions for begin() and end() yield the beginning and 167end element regions for it. 168 169In my view this differs from ghost variables in that (1) this storage does 170actually exist (it is just a library implementation detail where that storage 171lives) and (2) it is perfectly valid for a pointer into that storage to be 172returned and for another part of the program to read or write from that storage. 173(Well, in this case just read since it is allowed to be read-only memory). 174 175What I'm not OK with is modeling abstract analysis state (for example, the count 176of a NSMutableArray or the typestate of a file handle) as a value stored in some 177ginned up region in the store. This takes an easy problem that the analyzer does 178well at (modeling typestate) and turns it into a hard one that the analyzer is 179bad at (reasoning about the contents of the heap). 180 181I think the key criterion here is: "is the region accessible from outside the 182library". That is, does the library expose the region as a pointer that can be 183read to or written from in the client program? If so, then it makes sense for 184this to be in the store: we are modeling reachable storage as storage. But if 185we're just modeling arbitrary analysis facts that need to be invalidated when a 186pointer escapes then we shouldn't try to gin up storage for them just to get 187invalidation for free. 188 189**Artem:** 190 191> In this case, I would be fine with some sort of ``AbstractStorageMemoryRegion`` 192> that meant "here is a memory region and somewhere reachable from here exists 193> another region of type T". Or even multiple regions with different 194> identifiers. This wouldn't specify how the memory is reachable, but it would 195> allow for transfer functions to get at those regions and it would allow for 196> invalidation. 197 198Yeah, this is what we can easily implement now as a 199symbolic-region-based-on-a-metadata-symbol (though we can make a new region 200class for that if we eg. want it typed). The problem is that the relation 201between such storage region and its parent object region is essentially 202immaterial, similarly to the relation between ``SymbolRegionValue`` and its parent 203region. Region contents are mutable: today the abstract storage is reachable 204from its parent object, tomorrow it's not, and maybe something else becomes 205reachable, something that isn't even abstract. So the parent region for the 206abstract storage is most of the time at best a "nice to know" thing - we cannot 207rely on it to do any actual work. We'd anyway need to rely on the checker to do 208the job. 209 210> For std::initializer_list this reachable region would the region for the 211> backing array and the transfer functions for begin() and end() yield the 212> beginning and end element regions for it. 213 214So maybe in fact for std::initializer_list it may work fine because you cannot 215change the data after the object is constructed - so this region's contents are 216essentially immutable. For the future, i feel as if it is a dead end. 217 218I'd like to consider another funny example. Suppose we're trying to model 219 220.. code-block:: cpp 221 222 std::unique_ptr. Consider:: 223 224 void bar(const std::unique_ptr<int> &x); 225 226 void foo(std::unique_ptr<int> &x) { 227 int *a = x.get(); // (a, 0, direct): &AbstractStorageRegion 228 *a = 1; // (AbstractStorageRegion, 0, direct): 1 S32b 229 int *b = new int; 230 *b = 2; // (SymRegion{conj_$0<int *>}, 0 ,direct): 2 S32b 231 x.reset(b); // Checker map: x -> SymRegion{conj_$0<int *>} 232 bar(x); // 'a' doesn't escape (the pointer was unique), 'b' does. 233 clang_analyzer_eval(*a == 1); // Making this true is up to the checker. 234 clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker. 235 } 236 237The checker doesn't totally need to ensure that ``*a == 1`` passes - even though the 238pointer was unique, it could theoretically have ``.get()``-ed above and the code 239could of course break the uniqueness invariant (though we'd probably want it). 240The checker can say that "even if ``*a`` did escape, it was not because it was 241stuffed directly into bar()". 242 243The checker's direct responsibility, however, is to solve the ``*b == 2`` thing 244(which is in fact the problem we're dealing with in this patch - escaping the 245storage region of the object). 246 247So we're talking about one more operation over the program state (scanning 248reachable symbols and regions) that cannot work without checker support. 249 250We can probably add a new callback "checkReachableSymbols" to solve this. This 251is in fact also related to the dead symbols problem (we're scanning for live 252symbols in the store and in the checkers separately, but we need to do so 253simultaneously with a single worklist). Hmm, in fact this sounds like a good 254idea; we can replace checkLiveSymbols with checkReachableSymbols. 255 256Or we could just have ghost member variables, and no checker support required at 257all. For ghost member variables, the relation with their parent region (which 258would be their superregion) is actually useful, the mutability of their contents 259is expressed naturally, and the store automagically sees reachable symbols, live 260symbols, escapes, invalidations, whatever. 261 262> In my view this differs from ghost variables in that (1) this storage does 263> actually exist (it is just a library implementation detail where that storage 264> lives) and (2) it is perfectly valid for a pointer into that storage to be 265> returned and for another part of the program to read or write from that 266> storage. (Well, in this case just read since it is allowed to be read-only 267> memory). 268 269> What I'm not OK with is modeling abstract analysis state (for example, the 270> count of a NSMutableArray or the typestate of a file handle) as a value stored 271> in some ginned up region in the store.This takes an easy problem that the 272> analyzer does well at (modeling typestate) and turns it into a hard one that 273> the analyzer is bad at (reasoning about the contents of the heap). 274 275Yeah, i tend to agree on that. For simple typestates, this is probably an 276overkill, so let's definitely put aside the idea of "ghost symbolic regions" 277that i had earlier. 278 279But, to summarize a bit, in our current case, however, the typestate we're 280looking for is the contents of the heap. And when we try to model such 281typestates (complex in this specific manner, i.e. heap-like) in any checker, we 282have a choice between re-doing this modeling in every such checker (which is 283something analyzer is indeed good at, but at a price of making checkers heavy) 284or instead relying on the Store to do exactly what it's designed to do. 285 286> I think the key criterion here is: "is the region accessible from outside 287> the library". That is, does the library expose the region as a pointer that 288> can be read to or written from in the client program? If so, then it makes 289> sense for this to be in the store: we are modeling reachable storage as 290> storage. But if we're just modeling arbitrary analysis facts that need to be 291> invalidated when a pointer escapes then we shouldn't try to gin up storage 292> for them just to get invalidation for free. 293 294As a metaphor, i'd probably compare it to body farms - the difference between 295ghost member variables and metadata symbols seems to me like the difference 296between body farms and evalCall. Both are nice to have, and body farms are very 297pleasant to work with, even if not omnipotent. I think it's fine for a 298FunctionDecl's body in a body farm to have a local variable, even if such 299variable doesn't actually exist, even if it cannot be seen from outside the 300function call. I'm not seeing immediate practical difference between "it does 301actually exist" and "it doesn't actually exist, just a handy abstraction". 302Similarly, i think it's fine if we have a ``CXXRecordDecl`` with 303implementation-defined contents, and try to farm up a member variable as a handy 304abstraction (we don't even need to know its name or offset, only that it's there 305somewhere). 306 307**Artem:** 308 309We've discussed it in person with Devin, and he provided more points to think 310about: 311 312* If the initializer list consists of non-POD data, constructors of list's 313 objects need to take the sub-region of the list's region as this-region In the 314 current (v2) version of this patch, these objects are constructed elsewhere and 315 then trivial-copied into the list's metadata pointer region, which may be 316 incorrect. This is our overall problem with C++ constructors, which manifests in 317 this case as well. Additionally, objects would need to be constructed in the 318 analyzer's core, which would not be able to predict that it needs to take a 319 checker-specific region as this-region, which makes it harder, though it might 320 be mitigated by sharing the checker state traits. 321 322* Because "ghost variables" are not material to the user, we need to somehow 323 make super sure that they don't make it into the diagnostic messages. 324 325So, because this needs further digging into overall C++ support and rises too 326many questions, i'm delaying a better approach to this problem and will fall 327back to the original trivial patch. 328