1# heapprofd: Android Heap Profiler 2 3_**Status:** COMPLETED **·** fmayer, primiano **·** 2018-06-15_ 4_**Updated:** 2020-04-20_ 5 6## Objective 7Provide a low-overhead native heap profiling mechanism, with C++ and Java callstack attribution, usable by all processes on an Android system. This includes Java and native services. The mechanism is capable of exporting heap dumps into traces in order to be able to correlate heap information with other activity on the system. This feature was added in the Android 10 release. 8 9## Overview 10![](/docs/images/heapprofd-design/Architecture.png) 11 12Implement an out-of-process heap profiler. Do the minimal amount of processing in-line of malloc, and then delegate to a central component for further processing. This introduces a new daemon _heapprofd_. 13 14When tracing is enabled, either via a system property or a signal delivered to an existing process, a given percentage of malloc calls copies the current call stack into a shared memory buffer that is received by heapprofd. heapprofd uses libunwindstack asynchronously for stack unwinding and symbolization. This information is used to build bookkeeping tables to track live allocations and is ultimately dumped into the Perfetto trace. 15 16All data referenced in this design document was collected on a Pixel 2 on Android P. 17 18 19### Requirements 20These are the properties that must be fulfilled by a heap profiler for Android: 21 22**No setup:** A heap profile can be taken with a single command. 23 24**Profile running apps:** The system can be used to enable profiling of already running apps to get information of memory usage _since the profiling was enabled_ without requiring a restart. This can be useful to track down memory leaks. 25 26**Attribute to Java methods:** The system can be used to track the native memory usage of Java applications. Allocations on the Java heap are outside of the scope of this work. 27 28**Zero overhead when disabled:** the system must not incur a performance overhead when it is not enabled. 29 30**Profile whole system:** the system must be capable of handling the load of profiling all processes running. The sample rate is adjusted to limit the amount of data. 31 32**Negligible in-process memory overhead:** the system must not hold bookkeeping data in the process in order not to inflate higher-level metrics like PSS. 33 34**Bounded performance impact:** the device must still be usable for all use-cases. 35 36 37## Detailed Design 38 39### Enabling profiling 40 41#### Use case 1: profiling future allocations from a running process 42One of the real-time signals ([`BIONIC_SIGNAL_PROFILER`](https://cs.android.com/android/platform/superproject/+/master:bionic/libc/platform/bionic/reserved_signals.h?q=symbol:BIONIC_SIGNAL_PROFILER)) is reserved in libc as a triggering mechanism. In this scenario: 43 44* heapprofd sends a RT signal to the target process 45* Upon receipt of the signal, bionic reacts by installing a temporary malloc hook, which in turn spawns a thread to dynamically load libheapprofd.so in the process context. This means heapprofd will not work for statically linked binaries, as they lack the ability to `dlopen`. We can not spawn the thread directly from the signal handler, as `pthread_create` is not async-safe. 46* The initializer in libheapprofd.so is called to take care of the rest (see [client operation](#client-operation-and-in-process-hooks) below) 47 48 49#### Use case 2: profiling a single process from startup 50* heapprofd sets a property of the form libc.debug.heapprofd.argv0 (argv0 being the first argument in `/proc/self/cmdline`, up to the first ":") 51* Native processes: when bionic is initialized checks for the presence of the property and, if found and matches the process name, loads the libheapprofd.so. 52* Managed java processes: zygote calls `mallopt(M_SET_ZYGOTE_CHILD, ...)` in `PreApplicationInit`. In this, Bionic checks for the presence of the property and, if found and matches the process name, loads the libheapprofd.so and continues as above. 53 54#### Use case 3: profiling the whole system 55A system property `libc.heapprofd.enable` can be set to enable heap profiling on startup. When this property is set every process on startup will load the libheapprofd.so library. The rest is identical to the case above. 56 57 58### Disabling profiling 59Disabling profiling happens simply by virtue of shutting down the sockets from the heapprofd end. Upon send() failure the client will uninstall the hooks (see appendix: thread-safe hooks setup / teardown) 60 61 62### Client operation and in process hooks 63Upon initialization of libheapprofd.so: 64 65* The client establishes a connection to the heapprofd daemon through a connected UNIX socket. 66* Upon connection, the daemon will send a packet to the client specifying the profiling configuration (sampling rate; sample all threads / only specific threads; tuning of sampling heuristics). It will also send an FD for the SharedMemoryBuffer used to send samples (see [wire protocol](heapprofd-wire-protocol.md)). 67* The malloc hooks are installed. 68 69Upon each `*alloc()/posix_memalign()` call, the client library will perform some minimal bookkeeping. If the sampling rate is hit, it will copy the raw stack, together with a header specifying register state, tid, a global sequence number of the operation, and size of the allocation to the shared memory buffer. It will then send on a control socket to wake up the service. 70 71Upon each `free()` call, the client will append the freed address into a global (process-wide) append-only buffer (the buffer is to avoid the overhead of a send() for each free). This buffer of free()s is sent to the heapprofd daemon when the fixed-size buffer is full or after a preset number of operations. This also includes a global sequence number for the operation. 72 73If the send() fails because the heapprofd has shut down the socket, voluntarily (graceful disabling) or involuntarily (has crashed) the client will teardown the hooks and disabling any profiling operation. 74 75 76### Service operation 77![](/docs/images/heapprofd-design/shmem-detail.png) 78 79The unwinder thread read the client's shared memory buffers and handle the samples received. The result of the unwinding is then enqueued using a PostTask for the main thread to do the accounting. A queue-based model between the threads is chosen because it makes synchronization easier. No synchronization is needed at all in the main thread, as the bookkeeping data will only be accessed by it. 80 81If the sample is a malloc, the stack is unwound and the resulting data is handled in the main thread. The main thread ignores mallocs with sequence numbers lower than the one already processed for this address. If the sample is a free, it is added to a buffer. As soon as all mallocs with a sequence number lower than the free have been handled, it is processed. 82 83 84#### Unwinding 85libunwindstack is used for unwinding. A new Memory class is implemented that overlays the copied stack over the process memory (which is accessed using FDMemory). FDMemory uses read on `/proc/self/mem` file descriptors sent by the target application. 86 87``` 88class StackMemory : public unwindstack::MemoryRemote { 89 public: 90 ... 91 size_t Read(uint64_t addr, void* dst, size_t size) override { 92 if (addr >= sp_ && addr + size <= stack_end_ && addr + size > sp_) { 93 size_t offset = static_cast<size_t>(addr - sp_); 94 memcpy(dst, stack_ + offset, size); 95 return size; 96 } 97 98 return mem_->Read(addr, dst, size); 99 } 100 101 private: 102 uint64_t sp_; 103 uint8_t* stack_; 104 size_t size_; 105}; 106``` 107 108This allows unwinding to work both for native code and all three execution modes of ART. Native libraries are mapped into the process memory, and ephemeral debug information written by ART is also accessible through the process memory. There is a chance that ART will garbage collect the information before the unwinding is done, in which case we will miss stack frames. As this is a sampling approach anyway, that loss of accuracy is acceptable. 109 110Remote unwinding also enables us to use _global caching_ (`Elf::SetCachingEnabled(true)`) in libunwindstack. This prevents debug information being used by different processes to be loaded and decompressed multiple times. 111 112We add an `FDMaps` objects to parse maps from `/proc/self/maps` sent by the target process. We keep `FDMaps` object cached per process that is being profiled. This both saves the overhead of text-parsing `/proc/[pid]/maps` as well as keeps various objects needed for unwinding (e.g. decompressed minidebuginfo). In case an unwind fails with `ERROR_INVALID_MAP` we reparse the maps object. We will do changes to libunwindstack to create a more general version of [`LocalUpdatableMaps`](https://cs.android.com/android/platform/superproject/+/master:system/unwinding/libunwindstack/Maps.cpp?q=symbol:LocalUpdatableMaps) that is also applicable for remote processes. 113 114 115#### Advantages of remote unwinding 116 117**Crash-proofness:** Crashing bugs in the bookkeeping logic or libunwindstack do not result in user-visible crashes but only lack of profiling data. It will result in the connections to heapprofd being broken and profiling gracefully stopping on the client-side. 118 119**Performance:** copying the stack has much more consistent and higher performance than unwinding, which can take multiple milliseconds. See graph above. 120 121**Does not inflate higher-level metrics:** higher-level metrics such as PSS are not inflated by the book-keeping cost. 122 123**Compression:** bookkeeping of unwound frames can be more efficient if it is shared between multiple processes. E.g. common sequences of frames (in libc, ART, etc) can be deduped. 124 125 126#### Disadvantages of remote unwinding 127**Complexity:** the system has higher complexity than unwinding and symbolizing synchronously. 128 129#### Bookkeeping 130The data is stored as a tree where each element has a back-pointer to its parent. This deduplicates repeated stack frames. String interning is applied for method names and library names. 131 132Details will change to adapt to data collected during the implementation. 133 134 135### Wire protocol 136In early versions of heapprofd, we used a `SOCK_STREAM` socket to send callstacks to the service. We now use a shared memory based [wire protocol](heapprofd-wire-protocol.md) described in detail separately. 137 138### Failure modes 139**heapprofd unwinding cannot keep up:** The shared memory buffer will reject new samples. If `block_client` is set, the client will retry until there is space in the shared memory buffer. 140 141**heapprofd crashes:** Writing on the control socket will fail, and the client will be torn down. 142 143**Writing in client fails:** If the write fails with any error code except `EINTR`, the connection is closed, and profiling is torn down. 144 145 146### Fork handling 147After a process forks, we need to clean up the state that was initialized by the parent process and uninstall the malloc hooks. We do not intend to currently support following forks, see [Alternatives Considered](#alternatives-considered) for possible implementation thereof. 148 149## Performance considerations 150 151### Remote unwinding 152_**Note:** This data was collected when heapprofd used a socket to communicate from client to service. We now use a shared-memory buffer, so we should be even lower overhead._ 153 154Remote unwinding is used to reduce the performance impact on the applications that are being profiled. After the stack has been sent, the application can resume its operation while the remote daemon unwinds the stack and does unwinding. As sending the stack is, on average, a faster operation than unwinding the stack, this results in a performance gain. 155 156 157<table> 158 <tr> 159 <td> 160 161![](/docs/images/heapprofd-design/Android-Heap0.png) 162 163 </td> 164 <td> 165 166![](/docs/images/heapprofd-design/Android-Heap1.png) 167 168 </td> 169 </tr> 170</table> 171 172**Mean unwind:** 413us 173**Mean send:** 50us 174**Median unwind:** 193us 175**Median send:** 14us 176**90 percentile unwind:** 715us 177**90 percentile send:** 40us 178 179 180### Sampling 181Unwinding the stack on every `malloc` call has a high cost that is not always worth paying. Thus malloc calls are sampled client-side using Poisson sampling with a probability proportional to their allocation size (i.e. larger allocations are more likely to be considered than small ones). All memory allocated since the last malloc considered is attributed to this allocation. 182 183The sampling rate is configurable as part of the initial handshake. A sampling rate == 1 will degenerate into the fully-accurate high-overhead mode. 184 185Prior art: [crbug.com/812262](http://crbug.com/812262), [crbug.com/803276](http://crbug.com/803276). 186 187## Implementation Plan 188### Implement prototype [done] 189Implement a prototype of the system described above that works with SELinux `setenforce 0` and running as root on walleye. 190 191### Implement Benchmark [done] 192Implement a program that executes malloc / free calls from ground truth data. Profile this program using heapprofd, then compare results to ground truth data. Use this to iterate on sampling heuristics. 193 194### Productionize [done] 195Do security changes required to run heapprofd with `setenforce 1` and as non-root. 196 197 198## Testing Plan 199 200* Employ fuzzing on the shared memory buffer. [done] 201* Unit-tests for components. [done] 202* CTS. [done] 203 204 205## Background 206 207### ART modes of execution 208ART (Android RunTime, the Android Java Runtime) has three different modes of execution. 209 210**Interpreted:** Java byte-code is interpreted during execution. Instrumentation in ART allows to get dexpc (~offset from dex file) for the code being executed. 211 212**JIT-compiled:** Java byte-code is compiled to native code during run-time. Both the code and the ELF information only live in process memory. The debug information is stored in a global variable, currently only if the app is debuggable or a global system property (`dalvik.vm.minidebuginfo`) is set. This is because the current implementation incurs a memory overhead that is too high to default enable. 213 214**AOT (ahead of time) compiled:** Java code is compiled into native code before run-time. This produces an .oat file, which is essentially an .so. Both code and ELF information is stored on disk. During execution, like shared native libraries, it is memory mapped into process memory. 215 216### Stack Unwinding 217Stack unwinding is the process of determining the chain of return addresses from the raw bytes of the stack. These are the addresses we want to attribute the allocated memory to. 218 219The most efficient way of stack unwinding is using frame pointers. This is unreliable on Android as we do not control build parameters for vendor libraries or OEM builds and due to issues on ARM32. Thus, our stack unwinding relies on libunwindstack which uses DWARF information from the library ELF files to determine return addresses. This can significantly slower, with unwinding of a stack taking between 100μs and ~100 ms ([data from simpleperf](https://gist.github.com/fmayer/a3a5a352196f9037f34241f8fb09004d)). 220 221[libunwindstack](https://cs.android.com/android/platform/superproject/+/master:system/unwinding/libunwindstack/) is Android's replacement for [libunwind](https://www.nongnu.org/libunwind/). It has a modern C++ object-oriented API surface and support for Android specific features allowing it to unwind mixed native and Java applications using information emitted by ART depending on execution mode. It also supports symbolization for native code and all three execution modes or ART. 222 223### Symbolization 224Symbolization is the process of determining function name and line number from a code address. For builds by Google, we can get symbolized binaries (i.e. binaries with an ELF section that can be used for symbolization) from go/ab or https://ci.android.com (e.g. https://ci.android.com/builds/submitted/6410994/aosp_cf_x86_phone-userdebug/latest/aosp_cf_x86_phone-symbols-6410994.zip). 225 226For other builds, symbolization requires debug info contained within the binary. This information is often compressed. Symbolization of JIT-ed code requires information contained in process memory. 227 228### Perfetto 229[Perfetto](https://perfetto.dev) is an open-source, highly efficient and expandable platform-wide tracing system that allows collection of performance data from kernel, apps and services. It aims to become the next-gen performance tracing mechanism for both Android and Chrome. 230 231 232## Related Work 233 234### simpleperf 235Even though [simpleperf](https://cs.android.com/android/platform/superproject/+/master:system/extras/simpleperf/doc/README.md) is a CPU rather than memory profiler, it is similar in nature to the work proposed here in that it supports offline unwinding. The kernel is asked to provide copies of stack traces at regular intervals, which are dumped onto disk. The dumped information is then used to unwind the stacks after the profiling is complete. 236 237 238### malloc-debug 239[malloc-debug](https://cs.android.com/android/platform/superproject/+/master:bionic/libc/malloc_debug/) instruments bionic's allocation functions to detect common memory problems like buffer overflows, double frees, etc. This is similar to the project described in this document as it uses the same mechanism to instrument the libc allocation functions. Unlike heapprofd, it does not provide the user with heap dumps. 240 241 242### Feature Matrix 243| | use after free detection | Java object graph attribution | native memory attribution | Android | out-of-process | 244|--------------|--------------------------|-------------------------------|---------------------------|---------|----------------| 245| heapprofd | no | no | yes | yes | yes | 246| malloc-debug | yes | no | yes | yes | no | 247 248## Alternatives Considered 249 250### Copy-on-write stack 251The lower frames of the stack are unlikely to change between the client sending and the server unwinding the stack information. We wanted to exploit this fact by marking the stack pages as copy-on-write by [`vmsplice(2)`](http://man7.org/linux/man-pages/man2/vmsplice.2.html)-ing them into a pipe. Unfortunately, the vmsplice system call does not mark pages as copy-on-write, but is conceptually a mmap into the pipe buffer, which causes the daemon to see changes to the stack that happen after the vmsplice and hence corrupt the unwinder. 252 253 254### Profiling across fork(2) 255If we want to enable profiling for the newly forked process, we need to establish new connections to heapprofd and create a new connection pool. This is to prevent messages from the parent and child process from being interleaved. 256 257For non-zygote processes, we could use [`pthread_atfork(3)`](http://man7.org/linux/man-pages/man3/pthread_atfork.3.html) to establish new connections. 258 259For zygote processes, [`FileDescriptorInfo::ReopenOrDetach`](https://cs.android.com/android/platform/superproject/+/master:frameworks/base/core/jni/fd_utils.cpp?q=%22void%20FileDescriptorInfo::ReopenOrDetach%22), which is called after `fork(2)`– and thus after the `pthread_atfork` handlers – detaches all sockets, i.e. replaces them with file descriptors to `/dev/null`. If the socket is not contained within [`kPathWhiteList`](https://cs.android.com/android/platform/superproject/+/master:frameworks/base/core/jni/fd_utils.cpp?q=symbol:kPathWhitelist), zygote crashes instead. Thus using only a `pthread_atfork` handler is not feasible, as the connections established within will immediately get disconnected in zygote children. 260 261After forking, zygote calls [`PreApplicationInit`](https://cs.android.com/android/platform/superproject/+/master:frameworks/base/core/jni/com_android_internal_os_Zygote.cpp?q=symbol:PreApplicationInit), which is currently used by malloc\_debug to detect whether it is in the root zygote or in a child process by setting `gMallocLeakZygoteChild`. It also calls [Java callbacks](https://cs.android.com/android/platform/superproject/+/master:frameworks/base/core/jni/com_android_internal_os_Zygote.cpp?q=CallStaticVoidMethod.*gCallPostForkChildHooks), but there does not seem to currently exist a way to dynamically register native callbacks. 262 263Naive lazy initialization (i.e. closing the socket in the atfork handler, and then reconnecting on the first call to malloc) is problematic, as the code in zygote between fork and `ReopenOrDetach` might call `malloc`, thus leading to the connection to be established, which then gets closed by `ReopenOrDetach` again. 264 265To solve this, we can take an approach similar to `gMallocLeakZygoteChild`. Before forking, zygote will be modified to set `gheapprofdInZygoteFork` to true, and after the fork handling is finished it will be set to false. This way we can make sure we delay the lazy initialization until the fork is fully complete. `pthread_atfork` is used to close the file descriptors after fork in the child. 266 267 268### Profiling app from startup by externally detecting startup 269This option relies on the ability of the tracing system to detect app startup (which we'll need regardless for perf profiling). 270 271**Advantages** 272* one case fewer to handle from the libc viewpoint 273 274**Disadvantages** 275* less accurate, will miss the first X ms of startup 276* a mechanism that watches ftrace events to detect startup is non-trivial. 277 278### Delayed Unwinding 279Anticipating that many allocations are short-lived, we can delay the unwinding of stacks by a fixed time. This is a memory vs CPU usage trade-off, as these stacks have to be held in memory until they are either unwound or freed. 280 281This graph shows that 20 % of allocations are freed within 900 sampled allocations (at 1 %, so 500000 total) from the same process. 282 283 284<table> 285 <tr> 286 <td> 287 288![](/docs/images/heapprofd-design/Android-Heap2.png) 289 290<p> 291<strong>Mean:</strong> 7000 allocations 292 </td> 293 <td> 294 295![](/docs/images/heapprofd-design/Android-Heap3.png) 296 297<p> 298<strong>Mean:</strong> 8950 bytes 299 </td> 300 </tr> 301</table> 302 303 304So, at 1 % sampling rate, for at a cost of ~8 megabytes (900 \* 8950) per process, we can reduce the number of unwinds by around 20 %. This will not allow us to get an accurate number of "allocated space", so this idea was rejected. 305