1# Buffers and dataflow 2 3This page describes the dataflow in Perfetto when recording traces. It describes 4all the buffering stages, explains how to size the buffers and how to debug 5data losses. 6 7## Concepts 8 9Tracing in Perfetto is an asynchronous multiple-writer single-reader pipeline. 10In many senses, its architecture is very similar to modern GPUs' command 11buffers. 12 13The design principles of the tracing dataflow are: 14 15* The tracing fastpath is based on direct writes into a shared memory buffer. 16* Highly optimized for low-overhead writing. NOT optimized for low-latency 17 reading. 18* Trace data is eventually committed in the central trace buffer by the end 19 of the trace or when explicit flush requests are issued via the IPC channel. 20* Producers are untrusted and should not be able to see each-other's trace data, 21 as that would leak sensitive information. 22 23In the general case, there are two types buffers involved in a trace. When 24pulling data from the Linux kernel's ftrace infrastructure, there is a third 25stage of buffering (one per-CPU) involved: 26 27![Buffers](/docs/images/buffers.png) 28 29#### Tracing service's central buffers 30 31These buffers (yellow, in the picture above) are defined by the user in the 32`buffers` section of the [trace config](config.md). In the most simple cases, 33one tracing session = one buffer, regardless of the number of data sources and 34producers. 35 36This is the place where the tracing data is ultimately kept, while in memory, 37whether it comes from the kernel ftrace infrastructure, from some other data 38source in `traced_probes` or from another userspace process using the 39[Perfetto SDK](/docs/instrumentation/tracing-sdk.md). 40At the end of the trace (or during, if in [streaming mode]) these buffers are 41written into the output trace file. 42 43These buffers can contain a mixture of trace packets coming from different data 44sources and even different producer processes. What-goes-where is defined in the 45[buffers mapping section](config.md#dynamic-buffer-mapping) of the trace config. 46Because of this, the tracing buffers are not shared across processes, to avoid 47cross-talking and information leaking across producer processes. 48 49#### Shared memory buffers 50 51Each producer process has one memory buffer shared 1:1 with the tracing service 52(blue, in the picture above), regardless of the number of data sources it hosts. 53This buffer is a temporary staging buffer and has two purposes: 54 551. Zero-copy on the writer path. This buffer allows direct serialization of the 56 tracing data from the writer fastpath in a memory region directly readable by 57 the tracing service. 58 592. Decoupling writes from reads of the tracing service. The tracing service has 60 the job of moving trace packets from the shared memory buffer (blue) into the 61 central buffer (yellow) as fast as it can. 62 The shared memory buffer hides the scheduling and response latencies of the 63 tracing service, allowing the producer to keep writing without losing data 64 when the tracing service is temporarily blocked. 65 66#### Ftrace buffer 67 68When the `linux.ftrace` data source is enabled, the kernel will have its own 69per-CPU buffers. These are unavoidable because the kernel cannot write directly 70into user-space buffers. The `traced_probes` process will periodically read 71those buffers, convert the data into binary protos and follow the same dataflow 72of userspace tracing. These buffers need to be just large enough to hold data 73between two frace read cycles (`TraceConfig.FtraceConfig.drain_period_ms`). 74 75## Life of a trace packet 76 77Here is a summary to understand the dataflow of trace packets across buffers. 78Consider the case of a producer process hosting two data sources writing packets 79at a different rates, both targeting the same central buffer. 80 811. When each data source starts writing, it will grab a free page of the shared 82 memory buffer and directly serialize proto-encoded tracing data onto it. 83 842. When a page of the shared memory buffer is filled, the producer will send an 85 async IPC to the service, asking it to copy the shared memory page just 86 written. Then, the producer will grab the next free page in the shared memory 87 buffer and keep writing. 88 893. When the service receives the IPC, it copies the shared memory page into 90 the central buffer and marks the shared memory buffer page as free again. Data 91 sources within the producer are able to reuse that page at this point. 92 934. When the tracing session ends, the service sends a `Flush` request to all 94 data sources. In reaction to this, data sources will commit all outstanding 95 shared memory pages, even if not completely full. The services copies these 96 pages into the service's central buffer. 97 98![Dataflow animation](/docs/images/dataflow.svg) 99 100## Buffer sizing 101 102#### Central buffer sizing 103 104The math for sizing the central buffer is quite straightforward: in the default 105case of tracing without `write_into_file` (when the trace file is written only 106at the end of the trace), the buffer will hold as much data as it has been 107written by the various data sources. 108 109The total length of the trace will be `(buffer size) / (aggregated write rate)`. 110If all producers write at a combined rate of 2 MB/s, a 16 MB buffer will hold 111~ 8 seconds of tracing data. 112 113The write rate is highly dependent on the data sources configured and by the 114activity of the system. 1-2 MB/s is a typical figure on Android traces with 115scheduler tracing, but can go up easily by 1+ orders of magnitude if chattier 116data sources are enabled (e.g., syscall or pagefault tracing). 117 118When using [streaming mode] the buffer needs to be able to hold enough data 119between two `file_write_period_ms` periods (default: 5s). 120For instance, if `file_write_period_ms = 5000` and the write data rate is 2 MB/s 121the central buffer needs to be at least 5 * 2 = 10 MB to avoid data losses. 122 123#### Shared memory buffer sizing 124 125The sizing of the shared memory buffer depends on: 126 127* The scheduling characteristics of the underlying system, i.e. for how long the 128 tracing service can be blocked on the scheduler queues. This is a function of 129 the kernel configuration and nice-ness level of the `traced` process. 130* The max write rate of all data sources within a producer process. 131 132Suppose that a producer produce at a max rate of 8 MB/s. If `traced` gets 133blocked for 10 ms, the shared memory buffer need to be at least 8 * 0.01 = 80 KB 134to avoid losses. 135 136Empirical measurements suggest that on most Android systems a shared memory 137buffer size of 128-512 KB is good enough. 138 139The default shared memory buffer size is 256 KB. When using the Perfetto Client 140Library, this value can be tweaked setting `TracingInitArgs.shmem_size_hint_kb`. 141 142WARNING: if a data source writes very large trace packets in a single batch, 143either the shared memory buffer needs to be big enough to handle that or 144`BufferExhaustedPolicy.kStall` must be employed. 145 146For instance, consider a data source that emits a 2MB screenshot every 10s. 147Its (simplified) code, would look like: 148```c++ 149for (;;) { 150 ScreenshotDataSource::Trace([](ScreenshotDataSource::TraceContext ctx) { 151 auto packet = ctx.NewTracePacket(); 152 packet.set_bitmap(Grab2MBScreenshot()); 153 }); 154 std::this_thread::sleep_for(std::chrono::seconds(10)); 155} 156``` 157 158Its average write rate is 2MB / 10s = 200 KB/s. However, the data source will 159create bursts of 2MB back-to-back without yielding; it is limited only by the 160tracing serialization overhead. In practice, it will write the 2MB buffer at 161O(GB/s). If the shared memory buffer is < 2 MB, the tracing service will be 162unlikely to catch up at that rate and data losses will be experienced. 163 164In a case like this these options are: 165 166* Increase the size of the shared memory buffer in the producer that hosts the 167 data source. 168* Split the write into chunks spaced by some delay. 169* Adopt the `BufferExhaustedPolicy::kStall` when defining the data source: 170 171```c++ 172class ScreenshotDataSource : public perfetto::DataSource<ScreenshotDataSource> { 173 public: 174 constexpr static BufferExhaustedPolicy kBufferExhaustedPolicy = 175 BufferExhaustedPolicy::kStall; 176 ... 177}; 178``` 179 180## Debugging data losses 181 182#### Ftrace kernel buffer losses 183 184When using the Linux kernel ftrace data source, losses can occur in the 185kernel -> userspace path if the `traced_probes` process gets blocked for too 186long. 187 188At the trace proto level, losses in this path are recorded: 189* In the [`FtraceCpuStats`][FtraceCpuStats] messages, emitted both at the 190 beginning and end of the trace. If the `overrun` field is non-zero, data has 191 been lost. 192* In the [`FtraceEventBundle.lost_events`][FtraceEventBundle] field. This allows 193 to locate precisely the point where data loss happened. 194 195At the TraceProcessor SQL level, this data is available in the `stats` table: 196 197```sql 198> select * from stats where name like 'ftrace_cpu_overrun_end' 199name idx severity source value 200-------------------- -------------------- -------------------- ------ ------ 201ftrace_cpu_overrun_e 0 data_loss trace 0 202ftrace_cpu_overrun_e 1 data_loss trace 0 203ftrace_cpu_overrun_e 2 data_loss trace 0 204ftrace_cpu_overrun_e 3 data_loss trace 0 205ftrace_cpu_overrun_e 4 data_loss trace 0 206ftrace_cpu_overrun_e 5 data_loss trace 0 207ftrace_cpu_overrun_e 6 data_loss trace 0 208ftrace_cpu_overrun_e 7 data_loss trace 0 209``` 210 211These losses can be mitigated either increasing 212[`TraceConfig.FtraceConfig.buffer_size_kb`][FtraceConfig] 213 or decreasing 214[`TraceConfig.FtraceConfig.drain_period_ms`][FtraceConfig] 215 216#### Shared memory losses 217 218Tracing data can be lost in the shared memory due to bursts while traced is 219blocked. 220 221At the trace proto level, losses in this path are recorded: 222 223* In [`TraceStats.BufferStats.trace_writer_packet_loss`][BufferStats]. 224* In [`TracePacket.previous_packet_dropped`][TracePacket]. 225 Caveat: the very first packet emitted by every data source is also marked as 226 `previous_packet_dropped=true`. This is because the service has no way to 227 tell if that was the truly first packet or everything else before that was 228 lost. 229 230At the TraceProcessor SQL level, this data is available in the `stats` table: 231```sql 232> select * from stats where name = 'traced_buf_trace_writer_packet_loss' 233name idx severity source value 234-------------------- -------------------- -------------------- --------- ----- 235traced_buf_trace_wri 0 data_loss trace 0 236``` 237 238#### Central buffer losses 239 240Data losses in the central buffer can happen for two different reasons: 241 2421. When using `fill_policy: RING_BUFFER`, older tracing data is overwritten by 243 virtue of wrapping in the ring buffer. 244 These losses are recorded, at the trace proto level, in 245 [`TraceStats.BufferStats.chunks_overwritten`][BufferStats]. 246 2472. When using `fill_policy: DISCARD`, newer tracing data committed after the 248 buffer is full is dropped. 249 These losses are recorded, at the trace proto level, in 250 [`TraceStats.BufferStats.chunks_discarded`][BufferStats]. 251 252At the TraceProcessor SQL level, this data is available in the `stats` table, 253one entry per central buffer: 254 255```sql 256> select * from stats where name = 'traced_buf_chunks_overwritten' or name = 'traced_buf_chunks_discarded' 257name idx severity source value 258-------------------- -------------------- -------------------- ------- ----- 259traced_buf_chunks_di 0 info trace 0 260traced_buf_chunks_ov 0 data_loss trace 0 261``` 262 263Summary: the best way to detect and debug data losses is to use Trace Processor 264and issue the query: 265`select * from stats where severity = 'data_loss' and value != 0` 266 267## Atomicity and ordering guarantees 268 269A "writer sequence" is the sequence of trace packets emitted by a given 270TraceWriter from a data source. In almost all cases 1 data source == 2711+ TraceWriter(s). Some data sources that support writing from multiple threads 272typically create one TraceWriter per thread. 273 274* Trace packets written from a sequence are emitted in the trace file in the 275 same order they have been written. 276 277* There is no ordering guarantee between packets written by different sequences. 278 Sequences are, by design, concurrent and more than one linearization is 279 possible. The service does NOT respect global timestamp ordering across 280 different sequences. If two packets from two sequences were emitted in 281 global timestamp order, the service can still emit them in the trace file in 282 the opposite order. 283 284* Trace packets are atomic. If a trace packet is emitted in the trace file, it 285 is guaranteed to be contain all the fields that the data source wrote. If a 286 trace packet is large and spans across several shared memory buffer pages, the 287 service will save it in the trace file only if it can observe that all 288 fragments have been committed without gaps. 289 290* If a trace packet is lost (e.g. because of wrapping in the ring buffer 291 or losses in the shared memory buffer), no further trace packet will be 292 emitted for that sequence, until all packets before are dropped as well. 293 In other words, if the tracing service ends up in a situation where it sees 294 packets 1,2,5,6 for a sequence, it will only emit 1, 2. If, however, new 295 packets (e.g., 7, 8, 9) are written and they overwrite 1, 2, clearing the gap, 296 the full sequence 5, 6, 7, 8, 9 will be emitted. 297 This behavior, however, doesn't hold when using [streaming mode] because, 298 in that case, the periodic read will consume the packets in the buffer and 299 clear the gaps, allowing the sequence to restart. 300 301## Incremental state in trace packets 302 303In many cases trace packets are fully independent of each other and can be 304processed and interpreted without further context. 305In some cases, however, they can have _incremental state_ and behave similarly 306to inter-frame video encoding techniques, where some frames require the keyframe 307to be present to be meaningfully decoded. 308 309Here are are two concrete examples: 310 3111. Ftrace scheduling slices and /proc/pid scans. ftrace scheduling events are 312 keyed by thread id. In most cases users want to map those events back to the 313 parent process (the thread-group). To solve this, when both the 314 `linux.ftrace` and the `linux.process_stats` data sources are enabled in a 315 Perfetto trace, the latter does capture process<>thread associations from 316 the /proc pseudo-filesystem, whenever a new thread-id is seen by ftrace. 317 A typical trace in this case looks as follows: 318 ``` 319 # From process_stats's /proc scanner. 320 pid: 610; ppid: 1; cmdline: "/system/bin/surfaceflinger" 321 322 # From ftrace 323 timestamp: 95054961131912; sched_wakeup: pid: 610; target_cpu: 2; 324 timestamp: 95054977528943; sched_switch: prev_pid: 610 prev_prio: 98 325 ``` 326 The /proc entry is emitted only once per process to avoid bloating the size of 327 the trace. In lack of data losses this is fine to be able to reconstruct all 328 scheduling events for that pid. If, however, the process_stats packet gets 329 dropped in the ring buffer, there will be no way left to work out the process 330 details for all the other ftrace events that refer to that PID. 331 3322. The [Track Event library](/docs/instrumentation/track-events) in the Perfetto 333 SDK makes extensive use of string interning. Mos strings and descriptors 334 (e.g. details about processes / threads) are emitted only once and later 335 referred to using a monotonic ID. In case a loss of the descriptor packet, 336 it is not possible to make fully sense of those events. 337 338Trace Processor has built-in mechanism that detect loss of interning data and 339skips ingesting packets that refer to missing interned strings or descriptors. 340 341When using tracing in ring-buffer mode, these types of losses are very likely to 342happen. 343 344There are two mitigations for this: 345 3461. Issuing periodic invalidations of the incremental state via 347 [`TraceConfig.IncrementalStateConfig.clear_period_ms`][IncrStateConfig]. 348 This will cause the data sources that make use of incremental state to 349 periodically drop the interning / process mapping tables and re-emit the 350 descriptors / strings on the next occurrence. This mitigates quite well the 351 problem in the context of ring-buffer traces, as long as the 352 `clear_period_ms` is one order of magnitude lower than the estimated length 353 of trace data in the central trace buffer. 354 3552. Recording the incremental state into a dedicated buffer (via 356 `DataSourceConfig.target_buffer`). This technique is quite commonly used with 357 in the ftrace + process_stats example mentioned before, recording the 358 process_stats packet in a dedicated buffer less likely to wrap (ftrace events 359 are much more frequent than descriptors for new processes). 360 361## Flushes and windowed trace importing 362 363Another common problem experienced in traces that involve multiple data sources 364is the non-synchronous nature of trace commits. As explained in the 365[Life of a trace packet](#life-of-a-trace-packet) section above, trace data is 366committed only when a full memory page of the shared memory buffer is filled (or 367at when the tracing session ends). In most cases, if data sources produce events 368at a regular cadence, pages are filled quite quickly and events are committed 369in the central buffers within seconds. 370 371In some other cases, however, a data source can emit events only sporadically. 372Imagine the case of a data source that emits events when the display is turned 373on/off. Such an infrequent event might end up being staged in the shared memory 374buffer for very long times and can end up being committed in the trace buffer 375hours after it happened. 376 377Another scenario where this can happen is when using ftrace and when a 378particular CPU is idle most of the time or gets hot-unplugged (ftrace uses 379per-cpu buffers). In this case a CPU might record little-or-no data for several 380minutes while the other CPUs pump thousands of new trace events per second. 381 382This causes two side effects that end up breaking user expectations or causing 383bugs: 384 385* The UI can show an abnormally long timeline with a huge gap in the middle. 386 The packet ordering of events doesn't matter for the UI because events are 387 sorted by timestamp at import time. The trace in this case will contain very 388 recent events plus a handful of stale events that happened hours before. The 389 UI, for correctness, will try to display all events, showing a handful of 390 early events, followed by a huge temporal gap when nothing happened, 391 followed by the stream of recent events. 392 393* When recording long traces, Trace Processor can show import errors of the form 394 "XXX event out-of-order". This is because. in order to limit the memory usage 395 at import time, Trace Processor sorts events using a sliding window. If trace 396 packets are too out-of-order (trace file order vs timestamp order), the 397 sorting will fail and some packets will be dropped. 398 399#### Mitigations 400 401The best mitigation for these sort of problems is to specify a 402[`flush_period_ms`][TraceConfig] in the trace config (10-30 seconds is usually 403good enough for most cases), especially when recording long traces. 404 405This will cause the tracing service to issue periodic flush requests to data 406sources. A flush requests causes the data source to commit the shared memory 407buffer pages into the central buffer, even if they are not completely full. 408By default, a flush issued only at the end of the trace. 409 410In case of long traces recorded without `flush_period_ms`, another option is to 411pass the `--full-sort` option to `trace_processor_shell` when importing the 412trace. Doing so will disable the windowed sorting at the cost of a higher 413memory usage (the trace file will be fully buffered in memory before parsing). 414 415[streaming mode]: /docs/concepts/config#long-traces 416[TraceConfig]: /docs/reference/trace-config-proto.autogen#TraceConfig 417[FtraceConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig 418[IncrStateConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig.IncrementalStateConfig 419[FtraceCpuStats]: /docs/reference/trace-packet-proto.autogen#FtraceCpuStats 420[FtraceEventBundle]: /docs/reference/trace-packet-proto.autogen#FtraceEventBundle 421[TracePacket]: /docs/reference/trace-packet-proto.autogen#TracePacket 422[BufferStats]: /docs/reference/trace-packet-proto.autogen#TraceStats.BufferStats 423