1# ProtoZero design document 2 3ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary 4purposefully built for Perfetto's tracing use cases. 5 6## Motivations 7 8ProtoZero has been designed and optimized for proto serialization, which is used 9by all Perfetto tracing paths. 10Deserialization was introduced only at a later stage of the project and is 11mainly used by offline tools 12(e.g., [TraceProcessor](/docs/analysis/trace-processor.md). 13The _zero-copy zero-alloc zero-syscall_ statement applies only to the 14serialization code. 15 16Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace 17event in Perfetto is a proto 18(see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This 19allows events to be strongly typed and makes it easier for the team to maintain 20backwards compatibility using a language that is understood across the board. 21 22Tracing fast-paths need to have very little overhead, because instrumentation 23points are sprinkled all over the codebase of projects like Android 24and Chrome and are performance-critical. 25 26Overhead here is not just defined as CPU time (or instructions retired) it 27takes to execute the instrumentation point. A big source of overhead in a 28tracing system is represented by the working set of the instrumentation points, 29specifically extra I-cache and D-cache misses which would slow down the 30non-tracing code _after_ the tracing instrumentation point. 31 32The major design departures of ProtoZero from canonical C++ protobuf libraries 33like [libprotobuf](https://github.com/google/protobuf) are: 34 35* Treating serialization and deserialization as different use-cases served by 36 different code. 37 38* Optimizing for binary size and working-set-size on the serialization paths. 39 40* Ignoring most of the error checking and long-tail features of protobuf 41 (repeated vs optional, full type checks). 42 43* ProtoZero is not designed as general-purpose protobuf de/serialization and is 44 heavily customized to maintain the tracing writing code minimal and allow the 45 compiler to see through the architectural layers. 46 47* Code generated by ProtoZero needs to be hermetic. When building the 48 amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all 49 perfetto tracing sources need to not have any dependency on any other 50 libraries other than the C++ standard library and C library. 51 52## Usage 53 54At the build-system level, ProtoZero is extremely similar to the conventional 55libprotobuf library. 56The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the 57libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc` 58compiler plugin. 59 60ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends 61on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by 62it, however, has no runtime dependency (not even header-only dependencies) on 63libprotobuf. 64 65In order to generate ProtoZero stubs from proto you need to: 66 671. Build the ProtoZero compiler plugin, which lives in 68 [src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/). 69 ```bash 70 tools/ninja -C out/default protozero_plugin protoc 71 ``` 72 732. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`: 74 ```bash 75 out/default/protoc \ 76 --plugin=protoc-gen-plugin=out/default/protozero_plugin \ 77 --plugin_out=wrapper_namespace=pbzero:/tmp/ \ 78 test_msg.proto 79 ``` 80 This generates `/tmp/test_msg.pbzero.{cc,h}`. 81 82 NOTE: The .cc file is always empty. ProtoZero-generated code is header only. 83 The .cc file is emitted only because some build systems' rules assume that 84 protobuf codegens generate both a .cc and a .h file. 85 86## Proto serialization 87 88The quickest way to undestand ProtoZero design principles is to start from a 89small example and compare the generated code between libprotobuf and ProtoZero. 90 91```protobuf 92syntax = "proto2"; 93 94message TestMsg { 95 optional string str_val = 1; 96 optional int32 int_val = 2; 97 repeated TestMsg nested = 3; 98} 99``` 100 101#### libprotobuf approach 102 103The libprotobuf approach is to generate a C++ class that has one member for each 104proto field, with dedicated serialization and de-serialization methods. 105 106```bash 107out/default/protoc --cpp_out=. test_msg.proto 108``` 109 110generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks 111as follows: 112 113```c++ 114// This class is generated by the standard protoc compiler in the .pb.h source. 115class TestMsg : public protobuf::MessageLite { 116 private: 117 int32 int_val_; 118 ArenaStringPtr str_val_; 119 RepeatedPtrField<TestMsg> nested_; // Effectively a vector<TestMsg> 120 121 public: 122 const std::string& str_val() const; 123 void set_str_val(const std::string& value); 124 125 bool has_int_val() const; 126 int32_t int_val() const; 127 void set_int_val(int32_t value); 128 129 ::TestMsg* add_nested(); 130 ::TestMsg* mutable_nested(int index); 131 const TestMsg& nested(int index); 132 133 std::string SerializeAsString(); 134 bool ParseFromString(const std::string&); 135} 136``` 137 138The main characteristic of these stubs are: 139 140* Code generated from .proto messages can be used in the codebase as general 141 purpose objects, without ever using the `SerializeAs*()` or `ParseFrom*()` 142 methods (although anecdotal evidence suggests that most project use these 143 proto-generated classes only at the de/serialization endpoints). 144 145* The end-to-end journey of serializing a proto involves two steps: 146 1. Setting the individual int / string / vector fields of the generated class. 147 2. Doing a serialization pass over these fields. 148 149 In turn this has side-effects on the code generated. STL copy/assignment 150 operators for strings and vectors are non-trivial because, for instance, they 151 need to deal with dynamic memory resizing. 152 153#### ProtoZero approach 154 155```c++ 156// This class is generated by the ProtoZero plugin in the .pbzero.h source. 157class TestMsg : public protozero::Message { 158 public: 159 void set_str_val(const std::string& value) { 160 AppendBytes(/*field_id=*/1, value.data(), value.size()); 161 } 162 void set_str_val(const char* data, size_t size) { 163 AppendBytes(/*field_id=*/1, data, size); 164 } 165 void set_int_val(int32_t value) { 166 AppendVarInt(/*field_id=*/2, value); 167 } 168 TestMsg* add_nested() { 169 return BeginNestedMessage<TestMsg>(/*field_id=*/3); 170 } 171} 172``` 173 174The ProtoZero-generated stubs are append-only. As the `set_*`, `add_*` methods 175are invoked, the passed arguments are directly serialized into the target 176buffer. This introduces some limitations: 177 178* Readback is not possible: these classes cannot be used as C++ struct 179 replacements. 180 181* No error-checking is performed: nothing prevents a non-repeated field to be 182 emitted twice in the serialized proto if the caller accidentally calls a 183 `set_*()` method twice. Basic type checks are still performed at compile-time 184 though. 185 186* Nested fields must be filled in a stack fashion and cannot be written 187 interleaved. Once a nested message is started, its fields must be set before 188 going back setting the fields of the parent message. This turns out to not be 189 a problem for most tracing use-cases. 190 191This has a number of advantages: 192 193* The classes generated by ProtoZero don't add any extra state on top of the 194 base class they derive (`protozero::Message`). They define only inline 195 setter methods that call base-class serialization methods. Compilers can 196 see through all the inline expansions of these methods. 197 198* As a consequence of that, the binary cost of ProtoZero is independent of the 199 number of protobuf messages defined and their fields, and depends only on the 200 number of `set_*`/`add_*` calls. This (i.e. binary cost of non-used proto 201 messages and fields) anecdotally has been a big issue with libprotobuf. 202 203* The serialization methods don't involve any copy or dynamic allocation. The 204 inline expansion calls directly into the corresponding `AppendVarInt()` / 205 `AppendString()` methods of `protozero::Message`. 206 207* This allows to directly serialize trace events into the 208 [tracing shared memory buffers](/docs/concepts/buffers.md), even if they are 209 not contiguous. 210 211### Scattered buffer writing 212 213A key part of the ProtoZero design is supporting direct serialization on 214non-globally-contiguous sequences of contiguous memory regions. 215 216This happens by decoupling `protozero::Message`, the base class for all the 217generated classes, from the `protozero::ScatteredStreamWriter`. 218The problem it solves is the following: ProtoZero is based on direct 219serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in 220most cases. At the same time, there is no limit in how much data the caller will 221try to write into an individual message, a trace event can be up to 256 MiB big. 222 223![ProtoZero scattered buffers diagram](/docs/images/protozero-ssw.png) 224 225#### Fast-path 226 227At all times the underlying `ScatteredStreamWriter` knows what are the bounds 228of the current buffer. All write operations are bound checked and hit a 229slow-path when crossing the buffer boundary. 230 231Most write operations can be completed within the current buffer boundaries. 232In that case, the cost of a `set_*` operation is in essence a `memcpy()` with 233the extra overhead of var-int encoding for protobuf preambles and 234length-delimited fields. 235 236#### Slow-path 237 238When crossing the boundary, the slow-path asks the 239`ScatteredStreamWriter::Delegate` for a new buffer. The implementation of 240`GetNewBuffer()` is up to the client. In tracing use-cases, that call will 241acquire a new thread-local chunk from the tracing shared memory buffer. 242 243Other heap-based implementations are possible. For instance, the ProtoZero 244sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see 245[scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)), 246which allocates a new heap buffer when crossing the boundaries of the current 247one. 248 249Consider the following example: 250 251```c++ 252TestMsg outer_msg; 253for (int i = 0; i < 1000; i++) { 254 TestMsg* nested = outer_msg.add_nested(); 255 nested->set_int_val(42); 256} 257``` 258 259At some point one of the `set_int_val()` calls will hit the slow-path and 260acquire a new buffer. The overall idea is having a serialization mechanism 261that is extremely lightweight most of the times and that requires some extra 262function calls when buffer boundary, so that their cost gets amortized across 263all trace events. 264 265In the context of the overall Perfetto tracing use case, the slow-path involves 266grabbing a process-local mutex and finding the next free chunk in the shared 267memory buffer. Hence writes are lock-free as long as they happen within the 268thread-local chunk and require a critical section to acquire a new chunk once 269every 4KB-32KB (depending on the trace configuration). 270 271The assumption is that the likeliness that two threads will cross the chunk 272boundary and call `GetNewBuffer()` at the same time is extremely low and hence 273the critical section is un-contended most of the times. 274 275```mermaid 276sequenceDiagram 277 participant C as Call site 278 participant M as Message 279 participant SSR as ScatteredStreamWriter 280 participant DEL as Buffer Delegate 281 C->>M: set_int_val(...) 282 activate C 283 M->>SSR: AppendVarInt(...) 284 deactivate C 285 Note over C,SSR: A typical write on the fast-path 286 287 C->>M: set_str_val(...) 288 activate C 289 M->>SSR: AppendString(...) 290 SSR->>DEL: GetNewBuffer(...) 291 deactivate C 292 Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks. 293``` 294 295### Deferred patching 296 297Nested messages in the protobuf binary encoding are prefixed with their 298varint-encoded size. 299 300Consider the following: 301 302```c++ 303TestMsg* nested = outer_msg.add_nested(); 304nested->set_int_val(42); 305nested->set_str_val("foo"); 306``` 307 308The canonical encoding of this protobuf message, using libprotobuf, would be: 309 310```bash 3111a 07 0a 03 66 6f 6f 10 2a 312^-+-^ ^-----+------^ ^-+-^ 313 | | | 314 | | +--> Field ID: 2 [int_val], value = 42. 315 | | 316 | +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f). 317 | 318 +------> Field ID: 3 [nested], length: 7 # !!! 319``` 320 321The second byte in this sequence (07) is problematic for direct encoding. At the 322point where `outer_msg.add_nested()` is called, we can't possibly know upfront 323what the overall size of the nested message will be (in this case, 5 + 2 = 7). 324 325The way we get around this in ProtoZero is by reserving four bytes for the 326_size_ of each nested message and back-filling them once the message is 327finalized (or when we try to set a field in one of the parent messages). 328We do this by encoding the size of the message using redundant varint encoding, 329in this case: `87 80 80 00` instead of `07`. 330 331At the C++ level, the `protozero::Message` class holds a pointer to its `size` 332field, which typically points to the beginning of the message, where the four 333bytes are reserved, and back-fills it in the `Message::Finalize()` pass. 334 335This works fine for cases where the entire message lies in one contiguous buffer 336but opens a further challenge: a message can be several MBs big. Looking at this 337from the overall tracing perspective, the shared memory buffer chunk that holds 338the beginning of a message can be long gone (i.e. committed in the central 339service buffer) by the time we get to the end. 340 341In order to support this use case, at the tracing code level (outside of 342ProtoZero), when a message crosses the buffer boundary, its `size` field gets 343redirected to a temporary patch buffer 344(see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then 345sent out-of-band, piggybacking over the next commit IPC (see 346[Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi)) 347 348### Performance characteristics 349 350NOTE: For the full code of the benchmark see 351 `/src/protozero/test/protozero_benchmark.cc` 352 353We consider two scenarios: writing a simple event and a nested event 354 355#### Simple event 356 357Consists of filling a flat proto message with of 4 integers (2 x 32-bit, 3582 x 64-bit) and a 32 bytes string, as follows: 359 360```c++ 361void FillMessage_Simple(T* msg) { 362 msg->set_field_int32(...); 363 msg->set_field_uint32(...); 364 msg->set_field_int64(...); 365 msg->set_field_uint64(...); 366 msg->set_field_string(...); 367} 368``` 369 370#### Nested event 371 372Consists of filling a similar message which is recursively nested 3 levels deep: 373 374```c++ 375void FillMessage_Nested(T* msg, int depth = 0) { 376 FillMessage_Simple(msg); 377 if (depth < 3) { 378 auto* child = msg->add_field_nested(); 379 FillMessage_Nested(child, depth + 1); 380 } 381} 382``` 383 384#### Comparison terms 385 386We compare, for the same message type, the performance of ProtoZero, 387libprotobuf and a speed-of-light serializer. 388 389The speed-of-light serializer is a very simple C++ class that just appends 390data into a linear buffer making all sorts of favourable assumptions. It does 391not use any binary-stable encoding, it does not perform bound checking, 392all writes are 64-bit aligned, it doesn't deal with any thread-safety. 393 394```c++ 395struct SOLMsg { 396 template <typename T> 397 void Append(T x) { 398 // The memcpy will be elided by the compiler, which will emit just a 399 // 64-bit aligned mov instruction. 400 memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x)); 401 ptr_ += sizeof(x); 402 } 403 404 void set_field_int32(int32_t x) { Append(x); } 405 void set_field_uint32(uint32_t x) { Append(x); } 406 void set_field_int64(int64_t x) { Append(x); } 407 void set_field_uint64(uint64_t x) { Append(x); } 408 void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); } 409 410 alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8]; 411 char* ptr_ = &storage_[0]; 412}; 413``` 414 415The speed-of-light serializer serves as a reference for _how fast a serializer 416could be if argument marshalling and bound checking were zero cost._ 417 418#### Benchmark results 419 420##### Google Pixel 3 - aarch64 421 422```bash 423$ cat out/droid_arm64/args.gn 424target_os = "android" 425is_clang = true 426is_debug = false 427target_cpu = "arm64" 428 429$ ninja -C out/droid_arm64/ perfetto_benchmarks && \ 430 adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \ 431 adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*' 432 433------------------------------------------------------------------------ 434Benchmark Time CPU Iterations 435------------------------------------------------------------------------ 436BM_Protozero_Simple_Libprotobuf 402 ns 398 ns 1732807 437BM_Protozero_Simple_Protozero 242 ns 239 ns 2929528 438BM_Protozero_Simple_SpeedOfLight 118 ns 117 ns 6101381 439BM_Protozero_Nested_Libprotobuf 1810 ns 1800 ns 390468 440BM_Protozero_Nested_Protozero 780 ns 773 ns 901369 441BM_Protozero_Nested_SpeedOfLight 138 ns 136 ns 5147958 442``` 443 444##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux 445 446```bash 447 448$ cat out/linux_clang_release/args.gn 449is_clang = true 450is_debug = false 451 452$ ninja -C out/linux_clang_release/ perfetto_benchmarks && \ 453 out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto* 454 455------------------------------------------------------------------------ 456Benchmark Time CPU Iterations 457------------------------------------------------------------------------ 458BM_Protozero_Simple_Libprotobuf 428 ns 428 ns 1624801 459BM_Protozero_Simple_Protozero 261 ns 261 ns 2715544 460BM_Protozero_Simple_SpeedOfLight 111 ns 111 ns 6297387 461BM_Protozero_Nested_Libprotobuf 1625 ns 1625 ns 436411 462BM_Protozero_Nested_Protozero 843 ns 843 ns 849302 463BM_Protozero_Nested_SpeedOfLight 140 ns 140 ns 5012910 464``` 465