1# ProtoZero design document
2
3ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary
4purposefully built for Perfetto's tracing use cases.
5
6## Motivations
7
8ProtoZero has been designed and optimized for proto serialization, which is used
9by all Perfetto tracing paths.
10Deserialization was introduced only at a later stage of the project and is
11mainly used by offline tools
12(e.g., [TraceProcessor](/docs/analysis/trace-processor.md).
13The _zero-copy zero-alloc zero-syscall_ statement applies only to the
14serialization code.
15
16Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace
17event in Perfetto is a proto
18(see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This
19allows events to be strongly typed and makes it easier for the team to maintain
20backwards compatibility using a language that is understood across the board.
21
22Tracing fast-paths need to have very little overhead, because instrumentation
23points are sprinkled all over the codebase of projects like Android
24and Chrome and are performance-critical.
25
26Overhead here is not just defined as CPU time (or instructions retired) it
27takes to execute the instrumentation point. A big source of overhead in a
28tracing system is represented by the working set of the instrumentation points,
29specifically extra I-cache and D-cache misses which would slow down the
30non-tracing code _after_ the tracing instrumentation point.
31
32The major design departures of ProtoZero from canonical C++ protobuf libraries
33like [libprotobuf](https://github.com/google/protobuf) are:
34
35* Treating serialization and deserialization as different use-cases served by
36  different code.
37
38* Optimizing for binary size and working-set-size on the serialization paths.
39
40* Ignoring most of the error checking and long-tail features of protobuf
41  (repeated vs optional, full type checks).
42
43* ProtoZero is not designed as general-purpose protobuf de/serialization and is
44  heavily customized to maintain the tracing writing code minimal and allow the
45  compiler to see through the architectural layers.
46
47* Code generated by ProtoZero needs to be hermetic. When building the
48  amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all
49  perfetto tracing sources need to not have any dependency on any other
50  libraries other than the C++ standard library and C library.
51
52## Usage
53
54At the build-system level, ProtoZero is extremely similar to the conventional
55libprotobuf library.
56The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the
57libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc`
58compiler plugin.
59
60ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends
61on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by
62it, however, has no runtime dependency (not even header-only dependencies) on
63libprotobuf.
64
65In order to generate ProtoZero stubs from proto you need to:
66
671. Build the ProtoZero compiler plugin, which lives in
68   [src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/).
69   ```bash
70   tools/ninja -C out/default protozero_plugin protoc
71   ```
72
732. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`:
74   ```bash
75  out/default/protoc \
76      --plugin=protoc-gen-plugin=out/default/protozero_plugin \
77      --plugin_out=wrapper_namespace=pbzero:/tmp/  \
78      test_msg.proto
79   ```
80   This generates `/tmp/test_msg.pbzero.{cc,h}`.
81
82   NOTE: The .cc file is always empty. ProtoZero-generated code is header only.
83   The .cc file is emitted only because some build systems' rules assume that
84   protobuf codegens generate both a .cc and a .h file.
85
86## Proto serialization
87
88The quickest way to undestand ProtoZero design principles is to start from a
89small example and compare the generated code between libprotobuf and ProtoZero.
90
91```protobuf
92syntax = "proto2";
93
94message TestMsg {
95  optional string str_val = 1;
96  optional int32 int_val = 2;
97  repeated TestMsg nested = 3;
98}
99```
100
101#### libprotobuf approach
102
103The libprotobuf approach is to generate a C++ class that has one member for each
104proto field, with dedicated serialization and de-serialization methods.
105
106```bash
107out/default/protoc  --cpp_out=. test_msg.proto
108```
109
110generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks
111as follows:
112
113```c++
114// This class is generated by the standard protoc compiler in the .pb.h source.
115class TestMsg : public protobuf::MessageLite {
116  private:
117   int32 int_val_;
118   ArenaStringPtr str_val_;
119   RepeatedPtrField<TestMsg> nested_;  // Effectively a vector<TestMsg>
120
121 public:
122  const std::string& str_val() const;
123  void set_str_val(const std::string& value);
124
125  bool has_int_val() const;
126  int32_t int_val() const;
127  void set_int_val(int32_t value);
128
129  ::TestMsg* add_nested();
130  ::TestMsg* mutable_nested(int index);
131  const TestMsg& nested(int index);
132
133  std::string SerializeAsString();
134  bool ParseFromString(const std::string&);
135}
136```
137
138The main characteristic of these stubs are:
139
140* Code generated from .proto messages can be used in the codebase as general
141  purpose objects, without ever using the `SerializeAs*()` or `ParseFrom*()`
142  methods (although anecdotal evidence suggests that most project use these
143  proto-generated classes only at the de/serialization endpoints).
144
145* The end-to-end journey of serializing a proto involves two steps:
146  1. Setting the individual int / string / vector fields of the generated class.
147  2. Doing a serialization pass over these fields.
148
149  In turn this has side-effects on the code generated. STL copy/assignment
150  operators for strings and vectors are non-trivial because, for instance, they
151  need to deal with dynamic memory resizing.
152
153#### ProtoZero approach
154
155```c++
156// This class is generated by the ProtoZero plugin in the .pbzero.h source.
157class TestMsg : public protozero::Message {
158 public:
159  void set_str_val(const std::string& value) {
160    AppendBytes(/*field_id=*/1, value.data(), value.size());
161  }
162  void set_str_val(const char* data, size_t size) {
163    AppendBytes(/*field_id=*/1, data, size);
164  }
165  void set_int_val(int32_t value) {
166    AppendVarInt(/*field_id=*/2, value);
167  }
168  TestMsg* add_nested() {
169    return BeginNestedMessage<TestMsg>(/*field_id=*/3);
170  }
171}
172```
173
174The ProtoZero-generated stubs are append-only. As the `set_*`, `add_*` methods
175are invoked, the passed arguments are directly serialized into the target
176buffer. This introduces some limitations:
177
178* Readback is not possible: these classes cannot be used as C++ struct
179  replacements.
180
181* No error-checking is performed: nothing prevents a non-repeated field to be
182  emitted twice in the serialized proto if the caller accidentally calls a
183  `set_*()` method twice. Basic type checks are still performed at compile-time
184  though.
185
186* Nested fields must be filled in a stack fashion and cannot be written
187  interleaved. Once a nested message is started, its fields must be set before
188  going back setting the fields of the parent message. This turns out to not be
189  a problem for most tracing use-cases.
190
191This has a number of advantages:
192
193* The classes generated by ProtoZero don't add any extra state on top of the
194  base class they derive (`protozero::Message`). They define only inline
195  setter methods that call base-class serialization methods. Compilers can
196  see through all the inline expansions of these methods.
197
198* As a consequence of that, the binary cost of ProtoZero is independent of the
199  number of protobuf messages defined and their fields, and depends only on the
200  number of `set_*`/`add_*` calls. This (i.e. binary cost of non-used proto
201  messages and fields) anecdotally has been a big issue with libprotobuf.
202
203* The serialization methods don't involve any copy or dynamic allocation. The
204  inline expansion calls directly into the corresponding `AppendVarInt()` /
205  `AppendString()` methods of `protozero::Message`.
206
207* This allows to directly serialize trace events into the
208  [tracing shared memory buffers](/docs/concepts/buffers.md), even if they are
209  not contiguous.
210
211### Scattered buffer writing
212
213A key part of the ProtoZero design is supporting direct serialization on
214non-globally-contiguous sequences of contiguous memory regions.
215
216This happens by decoupling `protozero::Message`, the base class for all the
217generated classes, from the `protozero::ScatteredStreamWriter`.
218The problem it solves is the following: ProtoZero is based on direct
219serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in
220most cases. At the same time, there is no limit in how much data the caller will
221try to write into an individual message, a trace event can be up to 256 MiB big.
222
223![ProtoZero scattered buffers diagram](/docs/images/protozero-ssw.png)
224
225#### Fast-path
226
227At all times the underlying `ScatteredStreamWriter` knows what are the bounds
228of the current buffer. All write operations are bound checked and hit a
229slow-path when crossing the buffer boundary.
230
231Most write operations can be completed within the current buffer boundaries.
232In that case, the cost of a `set_*` operation is in essence a `memcpy()` with
233the extra overhead of var-int encoding for protobuf preambles and
234length-delimited fields.
235
236#### Slow-path
237
238When crossing the boundary, the slow-path asks the
239`ScatteredStreamWriter::Delegate` for a new buffer. The implementation of
240`GetNewBuffer()` is up to the client. In tracing use-cases, that call will
241acquire a new thread-local chunk from the tracing shared memory buffer.
242
243Other heap-based implementations are possible. For instance, the ProtoZero
244sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see
245[scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)),
246which allocates a new heap buffer when crossing the boundaries of the current
247one.
248
249Consider the following example:
250
251```c++
252TestMsg outer_msg;
253for (int i = 0; i < 1000; i++) {
254  TestMsg* nested = outer_msg.add_nested();
255  nested->set_int_val(42);
256}
257```
258
259At some point one of the `set_int_val()` calls will hit the slow-path and
260acquire a new buffer. The overall idea is having a serialization mechanism
261that is extremely lightweight most of the times and that requires some extra
262function calls when buffer boundary, so that their cost gets amortized across
263all trace events.
264
265In the context of the overall Perfetto tracing use case, the slow-path involves
266grabbing a process-local mutex and finding the next free chunk in the shared
267memory buffer. Hence writes are lock-free as long as they happen within the
268thread-local chunk and require a critical section to acquire a new chunk once
269every 4KB-32KB (depending on the trace configuration).
270
271The assumption is that the likeliness that two threads will cross the chunk
272boundary and call `GetNewBuffer()` at the same time is extremely low and hence
273the critical section is un-contended most of the times.
274
275```mermaid
276sequenceDiagram
277  participant C as Call site
278  participant M as Message
279  participant SSR as ScatteredStreamWriter
280  participant DEL as Buffer Delegate
281  C->>M: set_int_val(...)
282  activate C
283  M->>SSR: AppendVarInt(...)
284  deactivate C
285  Note over C,SSR: A typical write on the fast-path
286
287  C->>M: set_str_val(...)
288  activate C
289  M->>SSR: AppendString(...)
290  SSR->>DEL: GetNewBuffer(...)
291  deactivate C
292  Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks.
293```
294
295### Deferred patching
296
297Nested messages in the protobuf binary encoding are prefixed with their
298varint-encoded size.
299
300Consider the following:
301
302```c++
303TestMsg* nested = outer_msg.add_nested();
304nested->set_int_val(42);
305nested->set_str_val("foo");
306```
307
308The canonical encoding of this protobuf message, using libprotobuf, would be:
309
310```bash
3111a 07 0a 03 66 6f 6f 10 2a
312^-+-^ ^-----+------^ ^-+-^
313  |         |          |
314  |         |          +--> Field ID: 2 [int_val], value = 42.
315  |         |
316  |         +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f).
317  |
318  +------> Field ID: 3 [nested], length: 7  # !!!
319```
320
321The second byte in this sequence (07) is problematic for direct encoding. At the
322point where `outer_msg.add_nested()` is called, we can't possibly know upfront
323what the overall size of the nested message will be (in this case, 5 + 2 = 7).
324
325The way we get around this in ProtoZero is by reserving four bytes for the
326_size_ of each nested message and back-filling them once the message is
327finalized (or when we try to set a field in one of the parent messages).
328We do this by encoding the size of the message using redundant varint encoding,
329in this case: `87 80 80 00` instead of `07`.
330
331At the C++ level, the `protozero::Message` class holds a pointer to its `size`
332field, which typically points to the beginning of the message, where the four
333bytes are reserved, and back-fills it in the `Message::Finalize()` pass.
334
335This works fine for cases where the entire message lies in one contiguous buffer
336but opens a further challenge: a message can be several MBs big. Looking at this
337from the overall tracing perspective, the shared memory buffer chunk that holds
338the beginning of a message can be long gone (i.e. committed in the central
339service buffer) by the time we get to the end.
340
341In order to support this use case, at the tracing code level (outside of
342ProtoZero), when a message crosses the buffer boundary, its `size` field gets
343redirected to a temporary patch buffer
344(see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then
345sent out-of-band, piggybacking over the next commit IPC (see
346[Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi))
347
348### Performance characteristics
349
350NOTE: For the full code of the benchmark see
351      `/src/protozero/test/protozero_benchmark.cc`
352
353We consider two scenarios: writing a simple event and a nested event
354
355#### Simple event
356
357Consists of filling a flat proto message with of 4 integers (2 x 32-bit,
3582 x 64-bit) and a 32 bytes string, as follows:
359
360```c++
361void FillMessage_Simple(T* msg) {
362  msg->set_field_int32(...);
363  msg->set_field_uint32(...);
364  msg->set_field_int64(...);
365  msg->set_field_uint64(...);
366  msg->set_field_string(...);
367}
368```
369
370#### Nested event
371
372Consists of filling a similar message which is recursively nested 3 levels deep:
373
374```c++
375void FillMessage_Nested(T* msg, int depth = 0) {
376  FillMessage_Simple(msg);
377  if (depth < 3) {
378    auto* child = msg->add_field_nested();
379    FillMessage_Nested(child, depth + 1);
380  }
381}
382```
383
384#### Comparison terms
385
386We compare, for the same message type, the performance of ProtoZero,
387libprotobuf and a speed-of-light serializer.
388
389The speed-of-light serializer is a very simple C++ class that just appends
390data into a linear buffer making all sorts of favourable assumptions. It does
391not use any binary-stable encoding, it does not perform bound checking,
392all writes are 64-bit aligned, it doesn't deal with any thread-safety.
393
394```c++
395struct SOLMsg {
396  template <typename T>
397  void Append(T x) {
398    // The memcpy will be elided by the compiler, which will emit just a
399    // 64-bit aligned mov instruction.
400    memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x));
401    ptr_ += sizeof(x);
402  }
403
404  void set_field_int32(int32_t x) { Append(x); }
405  void set_field_uint32(uint32_t x) { Append(x); }
406  void set_field_int64(int64_t x) { Append(x); }
407  void set_field_uint64(uint64_t x) { Append(x); }
408  void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); }
409
410  alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8];
411  char* ptr_ = &storage_[0];
412};
413```
414
415The speed-of-light serializer serves as a reference for _how fast a serializer
416could be if argument marshalling and bound checking were zero cost._
417
418#### Benchmark results
419
420##### Google Pixel 3 - aarch64
421
422```bash
423$ cat out/droid_arm64/args.gn
424target_os = "android"
425is_clang = true
426is_debug = false
427target_cpu = "arm64"
428
429$ ninja -C out/droid_arm64/ perfetto_benchmarks && \
430  adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \
431  adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*'
432
433------------------------------------------------------------------------
434Benchmark                                 Time           CPU Iterations
435------------------------------------------------------------------------
436BM_Protozero_Simple_Libprotobuf         402 ns        398 ns    1732807
437BM_Protozero_Simple_Protozero           242 ns        239 ns    2929528
438BM_Protozero_Simple_SpeedOfLight        118 ns        117 ns    6101381
439BM_Protozero_Nested_Libprotobuf        1810 ns       1800 ns     390468
440BM_Protozero_Nested_Protozero           780 ns        773 ns     901369
441BM_Protozero_Nested_SpeedOfLight        138 ns        136 ns    5147958
442```
443
444##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux
445
446```bash
447
448$ cat out/linux_clang_release/args.gn
449is_clang = true
450is_debug = false
451
452$ ninja -C out/linux_clang_release/ perfetto_benchmarks && \
453  out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto*
454
455------------------------------------------------------------------------
456Benchmark                                 Time           CPU Iterations
457------------------------------------------------------------------------
458BM_Protozero_Simple_Libprotobuf         428 ns        428 ns    1624801
459BM_Protozero_Simple_Protozero           261 ns        261 ns    2715544
460BM_Protozero_Simple_SpeedOfLight        111 ns        111 ns    6297387
461BM_Protozero_Nested_Libprotobuf        1625 ns       1625 ns     436411
462BM_Protozero_Nested_Protozero           843 ns        843 ns     849302
463BM_Protozero_Nested_SpeedOfLight        140 ns        140 ns    5012910
464```
465