1.. _module-pw_tokenizer:
2
3------------
4pw_tokenizer
5------------
6Logging is critical, but developers are often forced to choose between
7additional logging or saving crucial flash space. The ``pw_tokenizer`` module
8helps address this by replacing printf-style strings with binary tokens during
9compilation. This enables extensive logging with substantially less memory
10usage.
11
12.. note::
13  This usage of the term "tokenizer" is not related to parsing! The
14  module is called tokenizer because it replaces a whole string literal with an
15  integer token. It does not parse strings into separate tokens.
16
17The most common application of ``pw_tokenizer`` is binary logging, and it is
18designed to integrate easily into existing logging systems. However, the
19tokenizer is general purpose and can be used to tokenize any strings, with or
20without printf-style arguments.
21
22**Why tokenize strings?**
23
24  * Dramatically reduce binary size by removing string literals from binaries.
25  * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
26    tokens instead of strings. We've seen over 50% reduction in encoded log
27    contents.
28  * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
29  * Remove potentially sensitive log, assert, and other strings from binaries.
30
31Basic overview
32==============
33There are two sides to ``pw_tokenizer``, which we call tokenization and
34detokenization.
35
36  * **Tokenization** converts string literals in the source code to
37    binary tokens at compile time. If the string has printf-style arguments,
38    these are encoded to compact binary form at runtime.
39  * **Detokenization** converts tokenized strings back to the original
40    human-readable strings.
41
42Here's an overview of what happens when ``pw_tokenizer`` is used:
43
44  1. During compilation, the ``pw_tokenizer`` module hashes string literals to
45     generate stable 32-bit tokens.
46  2. The tokenization macro removes these strings by declaring them in an ELF
47     section that is excluded from the final binary.
48  3. After compilation, strings are extracted from the ELF to build a database
49     of tokenized strings for use by the detokenizer. The ELF file may also be
50     used directly.
51  4. During operation, the device encodes the string token and its arguments, if
52     any.
53  5. The encoded tokenized strings are sent off-device or stored.
54  6. Off-device, the detokenizer tools use the token database to decode the
55     strings to human-readable form.
56
57Example: tokenized logging
58--------------------------
59This example demonstrates using ``pw_tokenizer`` for logging. In this example,
60tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
61size (49 → 15 bytes).
62
63**Before**: plain text logging
64
65+------------------+-------------------------------------------+---------------+
66| Location         | Logging Content                           | Size in bytes |
67+==================+===========================================+===============+
68| Source contains  | ``LOG("Battery state: %s; battery         |               |
69|                  | voltage: %d mV", state, voltage);``       |               |
70+------------------+-------------------------------------------+---------------+
71| Binary contains  | ``"Battery state: %s; battery             | 41            |
72|                  | voltage: %d mV"``                         |               |
73+------------------+-------------------------------------------+---------------+
74|                  | (log statement is called with             |               |
75|                  | ``"CHARGING"`` and ``3989`` as arguments) |               |
76+------------------+-------------------------------------------+---------------+
77| Device transmits | ``"Battery state: CHARGING; battery       | 49            |
78|                  | voltage: 3989 mV"``                       |               |
79+------------------+-------------------------------------------+---------------+
80| When viewed      | ``"Battery state: CHARGING; battery       |               |
81|                  | voltage: 3989 mV"``                       |               |
82+------------------+-------------------------------------------+---------------+
83
84**After**: tokenized logging
85
86+------------------+-----------------------------------------------------------+---------+
87| Location         | Logging Content                                           | Size in |
88|                  |                                                           | bytes   |
89+==================+===========================================================+=========+
90| Source contains  | ``LOG("Battery state: %s; battery                         |         |
91|                  | voltage: %d mV", state, voltage);``                       |         |
92+------------------+-----------------------------------------------------------+---------+
93| Binary contains  | ``d9 28 47 8e`` (0x8e4728d9)                              | 4       |
94+------------------+-----------------------------------------------------------+---------+
95|                  | (log statement is called with                             |         |
96|                  | ``"CHARGING"`` and ``3989`` as arguments)                 |         |
97+------------------+-----------------------------------------------------------+---------+
98| Device transmits | =============== ============================== ========== | 15      |
99|                  | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e``  |         |
100|                  | --------------- ------------------------------ ---------- |         |
101|                  | Token           ``"CHARGING"`` argument        ``3989``,  |         |
102|                  |                                                as         |         |
103|                  |                                                varint     |         |
104|                  | =============== ============================== ========== |         |
105+------------------+-----------------------------------------------------------+---------+
106| When viewed      | ``"Battery state: CHARGING; battery voltage: 3989 mV"``   |         |
107+------------------+-----------------------------------------------------------+---------+
108
109Getting started
110===============
111Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
112section describes one way ``pw_tokenizer`` might be integrated with a project.
113These steps can be adapted as needed.
114
115  1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel
116     are provided. For Make or other build systems, add the files specified in
117     the BUILD.gn's ``pw_tokenizer`` target to the build.
118  2. Use the tokenization macros in your code. See `Tokenization`_.
119  3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
120     linker script. In GN and CMake, this step is done automatically.
121  4. Compile your code to produce an ELF file.
122  5. Run ``database.py create`` on the ELF file to generate a CSV token
123     database. See `Managing token databases`_.
124  6. Commit the token database to your repository. See notes in `Database
125     management`_.
126  7. Integrate a ``database.py add`` command to your build to automatically
127     update the committed token database. In GN, use the
128     ``pw_tokenizer_database`` template to do this. See `Update a database`_.
129  8. Integrate ``detokenize.py`` or the C++ detokenization library with your
130     tools to decode tokenized logs. See `Detokenization`_.
131
132Tokenization
133============
134Tokenization converts a string literal to a token. If it's a printf-style
135string, its arguments are encoded along with it. The results of tokenization can
136be sent off device or stored in place of a full string.
137
138Tokenization macros
139-------------------
140Adding tokenization to a project is simple. To tokenize a string, include
141``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
142
143Tokenize a string literal
144^^^^^^^^^^^^^^^^^^^^^^^^^
145The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
146token.
147
148.. code-block:: cpp
149
150  constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
151
152.. admonition:: When to use this macro
153
154  Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
155  %-style arguments.
156
157Tokenize to a handler function
158^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
160since it takes the fewest arguments. It encodes a tokenized string to a
161buffer on the stack. The size of the buffer is set with
162``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
163
164This macro is provided by the ``pw_tokenizer:global_handler`` facade. The
165backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage``
166C-linkage function.
167
168.. code-block:: cpp
169
170  PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
171
172  void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
173                                         size_t size_bytes);
174
175``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
176``uintptr_t`` argument to the global handler function. Values like a log level
177can be packed into the ``uintptr_t``.
178
179This macro is provided by the ``pw_tokenizer:global_handler_with_payload``
180facade. The backend for this facade must define the
181``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function.
182
183.. code-block:: cpp
184
185  PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
186                                             format_string_literal,
187                                             arguments...);
188
189  void pw_tokenizer_HandleEncodedMessageWithPayload(
190      uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes);
191
192.. admonition:: When to use these macros
193
194  Use anytime a global handler is sufficient, particularly for widely expanded
195  macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
196  ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
197  for tokenizing printf-style strings.
198
199Tokenize to a callback
200^^^^^^^^^^^^^^^^^^^^^^
201``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
202``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
203the call site. The size of the buffer is set with
204``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
205
206.. code-block:: cpp
207
208  PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
209
210.. admonition:: When to use this macro
211
212  Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
213  use for another purpose or more flexibility is needed.
214
215Tokenize to a buffer
216^^^^^^^^^^^^^^^^^^^^
217The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
218to a caller-provided buffer.
219
220.. code-block:: cpp
221
222  uint8_t buffer[BUFFER_SIZE];
223  size_t size_bytes = sizeof(buffer);
224  PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
225
226While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
227than the other macros, so its per-use code size overhead is larger.
228
229.. admonition:: When to use this macro
230
231  Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
232  other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
233  widely expanded macros, such as a logging macro, because it will result in
234  larger code size than its alternatives.
235
236.. _module-pw_tokenizer-custom-macro:
237
238Tokenize with a custom macro
239^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240Projects may need more flexbility than the standard ``pw_tokenizer`` macros
241provide. To support this, projects may define custom tokenization macros. This
242requires the use of two low-level ``pw_tokenizer`` macros:
243
244.. c:macro:: PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)
245
246  Tokenizes a format string and sets the ``_pw_tokenizer_token`` variable to the
247  token. Must be used in its own scope, since the same variable is used in every
248  invocation.
249
250  The tokenized string uses the specified :ref:`tokenization domain
251  <module-pw_tokenizer-domains>`.  Use ``PW_TOKENIZER_DEFAULT_DOMAIN`` for the
252  default. The token also may be masked; use ``UINT32_MAX`` to keep all bits.
253
254.. c:macro:: PW_TOKENIZER_ARG_TYPES(...)
255
256  Converts a series of arguments to a compact format that replaces the format
257  string literal.
258
259Use these two macros within the custom tokenization macro to call a function
260that does the encoding. The following example implements a custom tokenization
261macro for use with :ref:`module-pw_log_tokenized`.
262
263.. code-block:: cpp
264
265  #include "pw_tokenizer/tokenize.h"
266
267  #ifndef __cplusplus
268  extern "C" {
269  #endif
270
271  void EncodeTokenizedMessage(pw_tokenizer_Payload metadata,
272                              pw_tokenizer_Token token,
273                              pw_tokenizer_ArgTypes types,
274                              ...);
275
276  #ifndef __cplusplus
277  }  // extern "C"
278  #endif
279
280  #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...)         \
281    do {                                                                 \
282      PW_TOKENIZE_FORMAT_STRING(                                         \
283          PW_TOKENIZER_DEFAULT_DOMAIN, UINT32_MAX, format, __VA_ARGS__); \
284      EncodeTokenizedMessage(payload,                                    \
285                             _pw_tokenizer_token,                        \
286                             PW_TOKENIZER_ARG_TYPES(__VA_ARGS__)         \
287                                 PW_COMMA_ARGS(__VA_ARGS__));            \
288    } while (0)
289
290In this example, the ``EncodeTokenizedMessage`` function would handle encoding
291and processing the message. Encoding is done by the
292``pw::tokenizer::EncodedMessage`` class or ``pw::tokenizer::EncodeArgs``
293function from ``pw_tokenizer/encode_args.h``. The encoded message can then be
294transmitted or stored as needed.
295
296.. code-block:: cpp
297
298  #include "pw_log_tokenized/log_tokenized.h"
299  #include "pw_tokenizer/encode_args.h"
300
301  void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata,
302                              std::span<std::byte> message);
303
304  extern "C" void EncodeTokenizedMessage(const pw_tokenizer_Payload metadata,
305                                         const pw_tokenizer_Token token,
306                                         const pw_tokenizer_ArgTypes types,
307                                         ...) {
308    va_list args;
309    va_start(args, types);
310    pw::tokenizer::EncodedMessage encoded_message(token, types, args);
311    va_end(args);
312
313    HandleTokenizedMessage(metadata, encoded_message);
314  }
315
316.. admonition:: When to use a custom macro
317
318  Use existing tokenization macros whenever possible. A custom macro may be
319  needed to support use cases like the following:
320
321    * Variations of ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` that take
322      different arguments.
323    * Supporting global handler macros that use different handler functions.
324
325Binary logging with pw_tokenizer
326--------------------------------
327String tokenization is perfect for logging. Consider the following log macro,
328which gathers the file, line number, and log message. It calls the ``RecordLog``
329function, which formats the log string, collects a timestamp, and transmits the
330result.
331
332.. code-block:: cpp
333
334  #define LOG_INFO(format, ...) \
335      RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
336
337  void RecordLog(LogLevel level, const char* file, int line, const char* format,
338                 ...) {
339    if (level < current_log_level) {
340      return;
341    }
342
343    int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
344
345    va_list args;
346    va_start(args, format);
347    bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
348    va_end(args);
349
350    TransmitLog(TimeSinceBootMillis(), buffer, size);
351  }
352
353It is trivial to convert this to a binary log using the tokenizer. The
354``RecordLog`` call is replaced with a
355``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
356``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the
357timestamp and transmits the message with ``TransmitLog``.
358
359.. code-block:: cpp
360
361  #define LOG_INFO(format, ...)                   \
362      PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
363          (pw_tokenizer_Payload)LogLevel_INFO,    \
364          __FILE_NAME__ ":%d " format,            \
365          __LINE__,                               \
366          __VA_ARGS__);                           \
367
368  extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload(
369      uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) {
370    if (static_cast<LogLevel>(level) >= current_log_level) {
371      TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
372    }
373  }
374
375Note that the ``__FILE_NAME__`` string is directly included in the log format
376string. Since the string is tokenized, this has no effect on binary size. A
377``%d`` for the line number is added to the format string, so that changing the
378line of the log message does not generate a new token. There is no overhead for
379additional tokens, but it may not be desirable to fill a token database with
380duplicate log lines.
381
382Tokenizing function names
383-------------------------
384The string literal tokenization functions support tokenizing string literals or
385constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
386special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
387as ``static constexpr char[]`` in C++ instead of the standard ``static const
388char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
389tokenized while compiling C++ with GCC or Clang.
390
391.. code-block:: cpp
392
393  // Tokenize the special function name variables.
394  constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
395  constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
396
397  // Tokenize the function name variables to a handler function.
398  PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__)
399  PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__)
400
401Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
402They are defined as static character arrays, so they cannot be implicitly
403concatentated with string literals. For example, ``printf(__func__ ": %d",
404123);`` will not compile.
405
406Tokenization in Python
407----------------------
408The Python ``pw_tokenizer.encode`` module has limited support for encoding
409tokenized messages with the ``encode_token_and_args`` function.
410
411.. autofunction:: pw_tokenizer.encode.encode_token_and_args
412
413Encoding
414--------
415The token is a 32-bit hash calculated during compilation. The string is encoded
416little-endian with the token followed by arguments, if any. For example, the
41731-byte string ``You can go about your business.`` hashes to 0xdac9a244.
418This is encoded as 4 bytes: ``44 a2 c9 da``.
419
420Arguments are encoded as follows:
421
422  * **Integers**  (1--10 bytes) --
423    `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
424    similarly to Protocol Buffers. Smaller values take fewer bytes.
425  * **Floating point numbers** (4 bytes) -- Single precision floating point.
426  * **Strings** (1--128 bytes) -- Length byte followed by the string contents.
427    The top bit of the length whether the string was truncated or
428    not. The remaining 7 bits encode the string length, with a maximum of 127
429    bytes.
430
431.. TODO: insert diagram here!
432
433.. tip::
434  ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
435  short or avoid encoding them as strings (e.g. encode an enum as an integer
436  instead of a string). See also `Tokenized strings as %s arguments`_.
437
438Token generation: fixed length hashing at compile time
439------------------------------------------------------
440String tokens are generated using a modified version of the x65599 hash used by
441the SDBM project. All hashing is done at compile time.
442
443In C code, strings are hashed with a preprocessor macro. For compatibility with
444macros, the hash must be limited to a fixed maximum number of characters. This
445value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
446``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
447the complexity of the hashing macros.
448
449C++ macros use a constexpr function instead of a macro. This function works with
450any length of string and has lower compilation time impact than the C macros.
451For consistency, C++ tokenization uses the same hash algorithm, but the
452calculated values will differ between C and C++ for strings longer than
453``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
454
455.. _module-pw_tokenizer-domains:
456
457Tokenization domains
458--------------------
459``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
460string label associated with each tokenized string. This allows projects to keep
461tokens from different sources separate. Potential use cases include the
462following:
463
464* Keep large sets of tokenized strings separate to avoid collisions.
465* Create a separate database for a small number of strings that use truncated
466  tokens, for example only 10 or 16 bits instead of the full 32 bits.
467
468If no domain is specified, the domain is empty (``""``). For many projects, this
469default domain is sufficient, so no additional configuration is required.
470
471.. code-block:: cpp
472
473  // Tokenizes this string to the default ("") domain.
474  PW_TOKENIZE_STRING("Hello, world!");
475
476  // Tokenizes this string to the "my_custom_domain" domain.
477  PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
478
479The database and detokenization command line tools default to reading from the
480default domain. The domain may be specified for ELF files by appending
481``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
482example, the following reads strings in ``some_domain`` from ``my_image.elf``.
483
484.. code-block:: sh
485
486  ./database.py create --database my_db.csv path/to/my_image.elf#some_domain
487
488See `Managing token databases`_ for information about the ``database.py``
489command line tool.
490
491Smaller tokens with masking
492---------------------------
493``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
494fewer than 32 bits does not improve runtime or code size efficiency. However,
495when tokens are packed into data structures or stored in arrays, the size of the
496token directly affects memory usage. In those cases, every bit counts, and it
497may be desireable to use fewer bits for the token.
498
499``pw_tokenizer`` allows users to provide a mask to apply to the token. This
500masked token is used in both the token database and the code. The masked token
501is not a masked version of the full 32-bit token, the masked token is the token.
502This makes it trivial to decode tokens that use fewer than 32 bits.
503
504Masking functionality is provided through the ``*_MASK`` versions of the macros.
505For example, the following generates 16-bit tokens and packs them into an
506existing value.
507
508.. code-block:: cpp
509
510  constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
511  uint32_t packed_word = (other_bits << 16) | token;
512
513Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
514used for tokens, the more likely two strings are to hash to the same token. See
515`token collisions`_.
516
517Token collisions
518----------------
519Tokens are calculated with a hash function. It is possible for different
520strings to hash to the same token. When this happens, multiple strings will have
521the same token in the database, and it may not be possible to unambiguously
522decode a token.
523
524The detokenization tools attempt to resolve collisions automatically. Collisions
525are resolved based on two things:
526
527  - whether the tokenized data matches the strings arguments' (if any), and
528  - if / when the string was marked as having been removed from the database.
529
530Working with collisions
531^^^^^^^^^^^^^^^^^^^^^^^
532Collisions may occur occasionally. Run the command
533``python -m pw_tokenizer.database report <database>`` to see information about a
534token database, including any collisions.
535
536If there are collisions, take the following steps to resolve them.
537
538  - Change one of the colliding strings slightly to give it a new token.
539  - In C (not C++), artificial collisions may occur if strings longer than
540    ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening,
541    consider setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.
542    See ``pw_tokenizer/public/pw_tokenizer/config.h``.
543  - Run the ``mark_removed`` command with the latest version of the build
544    artifacts to mark missing strings as removed. This deprioritizes them in
545    collision resolution.
546
547    .. code-block:: sh
548
549      python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
550
551    The ``purge`` command may be used to delete these tokens from the database.
552
553Probability of collisions
554^^^^^^^^^^^^^^^^^^^^^^^^^
555Hashes of any size have a collision risk. The probability of one at least
556one collision occurring for a given number of strings is unintuitively high
557(this is known as the `birthday problem
558<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
559used for tokens, the probability of collisions increases substantially.
560
561This table shows the approximate number of strings that can be hashed to have a
5621% or 50% probability of at least one collision (assuming a uniform, random
563hash).
564
565+-------+---------------------------------------+
566| Token | Collision probability by string count |
567| bits  +--------------------+------------------+
568|       |         50%        |          1%      |
569+=======+====================+==================+
570|   32  |       77000        |        9300      |
571+-------+--------------------+------------------+
572|   31  |       54000        |        6600      |
573+-------+--------------------+------------------+
574|   24  |        4800        |         580      |
575+-------+--------------------+------------------+
576|   16  |         300        |          36      |
577+-------+--------------------+------------------+
578|    8  |          19        |           3      |
579+-------+--------------------+------------------+
580
581Keep this table in mind when masking tokens (see `Smaller tokens with
582masking`_). 16 bits might be acceptable when tokenizing a small set of strings,
583such as module names, but won't be suitable for large sets of strings, like log
584messages.
585
586Token databases
587===============
588Token databases store a mapping of tokens to the strings they represent. An ELF
589file can be used as a token database, but it only contains the strings for its
590exact build. A token database file aggregates tokens from multiple ELF files, so
591that a single database can decode tokenized strings from any known ELF.
592
593Token databases contain the token, removal date (if any), and string for each
594tokenized string. Two token database formats are supported: CSV and binary.
595
596CSV database format
597-------------------
598The CSV database format has three columns: the token in hexadecimal, the removal
599date (if any) in year-month-day format, and the string literal, surrounded by
600quotes. Quote characters within the string are represented as two quote
601characters.
602
603This example database contains six strings, three of which have removal dates.
604
605.. code-block::
606
607  141c35d5,          ,"The answer: ""%s"""
608  2e668cd6,2019-12-25,"Jello, world!"
609  7b940e2a,          ,"Hello %s! %hd %e"
610  851beeb6,          ,"%u %d"
611  881436a0,2020-01-01,"The answer is: %s"
612  e13b0f94,2020-04-01,"%llu"
613
614Binary database format
615----------------------
616The binary database format is comprised of a 16-byte header followed by a series
617of 8-byte entries. Each entry stores the token and the removal date, which is
6180xFFFFFFFF if there is none. The string literals are stored next in the same
619order as the entries. Strings are stored with null terminators. See
620`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
621for full details.
622
623The binary form of the CSV database is shown below. It contains the same
624information, but in a more compact and easily processed form. It takes 141 B
625compared with the CSV database's 211 B.
626
627.. code-block:: text
628
629  [header]
630  0x00: 454b4f54 0000534e  TOKENS..
631  0x08: 00000006 00000000  ........
632
633  [entries]
634  0x10: 141c35d5 ffffffff  .5......
635  0x18: 2e668cd6 07e30c19  ..f.....
636  0x20: 7b940e2a ffffffff  *..{....
637  0x28: 851beeb6 ffffffff  ........
638  0x30: 881436a0 07e40101  .6......
639  0x38: e13b0f94 07e40401  ..;.....
640
641  [string table]
642  0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
643  0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
644  0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
645  0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
646  0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.
647
648Managing token databases
649------------------------
650Token databases are managed with the ``database.py`` script. This script can be
651used to extract tokens from compilation artifacts and manage database files.
652Invoke ``database.py`` with ``-h`` for full usage information.
653
654An example ELF file with tokenized logs is provided at
655``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
656file to experiment with the ``database.py`` commands.
657
658Create a database
659^^^^^^^^^^^^^^^^^
660The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
661etc.), archives (.a), or existing token databases (CSV or binary).
662
663.. code-block:: sh
664
665  ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
666
667Two database formats are supported: CSV and binary. Provide ``--type binary`` to
668``create`` to generate a binary database instead of the default CSV. CSV
669databases are great for checking into a source control or for human review.
670Binary databases are more compact and simpler to parse. The C++ detokenizer
671library only supports binary databases currently.
672
673Update a database
674^^^^^^^^^^^^^^^^^
675As new tokenized strings are added, update the database with the ``add``
676command.
677
678.. code-block:: sh
679
680  ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
681
682A CSV token database can be checked into a source repository and updated as code
683changes are made. The build system can invoke ``database.py`` to update the
684database after each build.
685
686GN integration
687^^^^^^^^^^^^^^
688Token databases may be updated or created as part of a GN build. The
689``pw_tokenizer_database`` template provided by
690``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
691strings database or creates a new database with artifacts from one or more GN
692targets or other database files.
693
694To create a new database, set the ``create`` variable to the desired database
695type (``"csv"`` or ``"binary"``). The database will be created in the output
696directory. To update an existing database, provide the path to the database with
697the ``database`` variable.
698
699.. code-block::
700
701  import("//build_overrides/pigweed.gni")
702
703  import("$dir_pw_tokenizer/database.gni")
704
705  pw_tokenizer_database("my_database") {
706    database = "database_in_the_source_tree.csv"
707    targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
708    input_databases = [ "other_database.csv" ]
709  }
710
711Instead of specifying GN targets, paths or globs to output files may be provided
712with the ``paths`` option.
713
714.. code-block::
715
716  pw_tokenizer_database("my_database") {
717    database = "database_in_the_source_tree.csv"
718    deps = [ ":apps" ]
719    optional_paths = [ "$root_build_dir/**/*.elf" ]
720  }
721
722.. note::
723
724  The ``paths`` and ``optional_targets`` arguments do not add anything to
725  ``deps``, so there is no guarantee that the referenced artifacts will exist
726  when the database is updated. Provide ``targets`` or ``deps`` or build other
727  GN targets first if this is a concern.
728
729Detokenization
730==============
731Detokenization is the process of expanding a token to the string it represents
732and decoding its arguments. This module provides Python and C++ detokenization
733libraries.
734
735**Example: decoding tokenized logs**
736
737A project might tokenize its log messages with the `Base64 format`_. Consider
738the following log file, which has four tokenized logs and one plain text log:
739
740.. code-block:: text
741
742  20200229 14:38:58 INF $HL2VHA==
743  20200229 14:39:00 DBG $5IhTKg==
744  20200229 14:39:20 DBG Crunching numbers to calculate probability of success
745  20200229 14:39:21 INF $EgFj8lVVAUI=
746  20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
747
748The project's log strings are stored in a database like the following:
749
750.. code-block::
751
752  1c95bd1c,          ,"Initiating retrieval process for recovery object"
753  2a5388e4,          ,"Determining optimal approach and coordinating vectors"
754  3743540c,          ,"Recovery object retrieval failed with status %s"
755  f2630112,          ,"Calculated acceptable probability of success (%.2f%%)"
756
757Using the detokenizing tools with the database, the logs can be decoded:
758
759.. code-block:: text
760
761  20200229 14:38:58 INF Initiating retrieval process for recovery object
762  20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
763  20200229 14:39:20 DBG Crunching numbers to calculate probability of success
764  20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
765  20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
766
767.. note::
768
769  This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
770  much space as the default binary format when encoded. For projects that wish
771  to interleave tokenized with plain text, using Base64 is a worthwhile
772  tradeoff.
773
774Python
775------
776To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
777package, and instantiate it with paths to token databases or ELF files.
778
779.. code-block:: python
780
781  import pw_tokenizer
782
783  detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
784
785  def process_log_message(log_message):
786      result = detokenizer.detokenize(log_message.payload)
787      self._log(str(result))
788
789The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
790class, which can be used in place of the standard ``Detokenizer``. This class
791monitors database files for changes and automatically reloads them when they
792change. This is helpful for long-running tools that use detokenization.
793
794C++
795---
796The C++ detokenization libraries can be used in C++ or any language that can
797call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
798Java Native Interface (JNI) implementation is provided.
799
800The C++ detokenization library uses binary-format token databases (created with
801``database.py create --type binary``). Read a binary format database from a
802file or include it in the source code. Pass the database array to
803``TokenDatabase::Create``, and construct a detokenizer.
804
805.. code-block:: cpp
806
807  Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
808
809  std::string ProcessLog(span<uint8_t> log_data) {
810    return detokenizer.Detokenize(log_data).BestString();
811  }
812
813The ``TokenDatabase`` class verifies that its data is valid before using it. If
814it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
815``ok()`` returns false. If the token database is included in the source code,
816this check can be done at compile time.
817
818.. code-block:: cpp
819
820  // This line fails to compile with a static_assert if the database is invalid.
821  constexpr TokenDatabase kDefaultDatabase =  TokenDatabase::Create<kData>();
822
823  Detokenizer OpenDatabase(std::string_view path) {
824    std::vector<uint8_t> data = ReadWholeFile(path);
825
826    TokenDatabase database = TokenDatabase::Create(data);
827
828    // This checks if the file contained a valid database. It is safe to use a
829    // TokenDatabase that failed to load (it will be empty), but it may be
830    // desirable to provide a default database or otherwise handle the error.
831    if (database.ok()) {
832      return Detokenizer(database);
833    }
834    return Detokenizer(kDefaultDatabase);
835  }
836
837Base64 format
838=============
839The tokenizer encodes messages to a compact binary representation. Applications
840may desire a textual representation of tokenized strings. This makes it easy to
841use tokenized messages alongside plain text messages, but comes at a small
842efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
843as binary messages.
844
845The Base64 format is comprised of a ``$`` character followed by the
846Base64-encoded contents of the tokenized message. For example, consider
847tokenizing the string ``This is an example: %d!`` with the argument -1. The
848string's token is 0x4b016e66.
849
850.. code-block:: text
851
852  Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
853
854   Plain text: This is an example: -1! [23 bytes]
855
856       Binary: 66 6e 01 4b 01          [ 5 bytes]
857
858       Base64: $Zm4BSwE=               [ 9 bytes]
859
860Encoding
861--------
862To encode with the Base64 format, add a call to
863``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
864in the tokenizer handler function. For example,
865
866.. code-block:: cpp
867
868  void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
869                                        size_t size_bytes) {
870    char base64_buffer[64];
871    size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
872        pw::span(encoded_message, size_bytes), base64_buffer);
873
874    TransmitLogMessage(base64_buffer, base64_size);
875  }
876
877Decoding
878--------
879Base64 decoding and detokenizing is supported in the Python detokenizer through
880the ``detokenize_base64`` and related functions.
881
882.. tip::
883  The Python detokenization tools support recursive detokenization for prefixed
884  Base64 text. Tokenized strings found in detokenized text are detokenized, so
885  prefixed Base64 messages can be passed as ``%s`` arguments.
886
887  For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
888  passed as an argument to the printf-style string ``Nested message: %s``, which
889  encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
890  as follows:
891
892  ::
893
894   "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
895
896Base64 decoding is supported in C++ or C with the
897``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
898functions.
899
900.. code-block:: cpp
901
902  void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
903                                        size_t size_bytes) {
904    char base64_buffer[64];
905    size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
906        pw::span(encoded_message, size_bytes), base64_buffer);
907
908    TransmitLogMessage(base64_buffer, base64_size);
909  }
910
911Command line utilities
912^^^^^^^^^^^^^^^^^^^^^^
913``pw_tokenizer`` provides two standalone command line utilities for detokenizing
914Base64-encoded tokenized strings.
915
916* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
917  stdin.
918* ``detokenize_serial.py`` -- Detokenizes Base64-encoded strings from a
919  connected serial device.
920
921If the ``pw_tokenizer`` Python package is installed, these tools may be executed
922as runnable modules. For example:
923
924.. code-block::
925
926  # Detokenize Base64-encoded strings in a file
927  python -m pw_tokenizer.detokenize -i input_file.txt
928
929  # Detokenize Base64-encoded strings in output from a serial device
930  python -m pw_tokenizer.detokenize_serial --device /dev/ttyACM0
931
932See the ``--help`` options for these tools for full usage information.
933
934Deployment war story
935====================
936The tokenizer module was developed to bring tokenized logging to an
937in-development product. The product already had an established text-based
938logging system. Deploying tokenization was straightforward and had substantial
939benefits.
940
941Results
942-------
943  * Log contents shrunk by over 50%, even with Base64 encoding.
944
945    * Significant size savings for encoded logs, even using the less-efficient
946      Base64 encoding required for compatibility with the existing log system.
947    * Freed valuable communication bandwidth.
948    * Allowed storing many more logs in crash dumps.
949
950  * Substantial flash savings.
951
952    * Reduced the size firmware images by up to 18%.
953
954  * Simpler logging code.
955
956    * Removed CPU-heavy ``snprintf`` calls.
957    * Removed complex code for forwarding log arguments to a low-priority task.
958
959This section describes the tokenizer deployment process and highlights key
960insights.
961
962Firmware deployment
963-------------------
964  * In the project's logging macro, calls to the underlying logging function
965    were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
966    invocation.
967  * The log level was passed as the payload argument to facilitate runtime log
968    level control.
969  * For this project, it was necessary to encode the log messages as text. In
970    ``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were
971    encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
972    messages.
973  * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
974
975.. attention::
976  Do not encode line numbers in tokenized strings. This results in a huge
977  number of lines being added to the database, since every time code moves,
978  new strings are tokenized. If line numbers are desired in a tokenized
979  string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
980
981Database management
982-------------------
983  * The token database was stored as a CSV file in the project's Git repo.
984  * The token database was automatically updated as part of the build, and
985    developers were expected to check in the database changes alongside their
986    code changes.
987  * A presubmit check verified that all strings added by a change were added to
988    the token database.
989  * The token database included logs and asserts for all firmware images in the
990    project.
991  * No strings were purged from the token database.
992
993.. tip::
994  Merge conflicts may be a frequent occurrence with an in-source database. If
995  the database is in-source, make sure there is a simple script to resolve any
996  merge conflicts. The script could either keep both sets of lines or discard
997  local changes and regenerate the database.
998
999Decoding tooling deployment
1000---------------------------
1001  * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
1002
1003      * Product-specific Python command line tools, using
1004        ``pw_tokenizer.Detokenizer``.
1005      * Standalone script for decoding prefixed Base64 tokens in files or
1006        live output (e.g. from ``adb``), using ``detokenize.py``'s command line
1007        interface.
1008
1009  * The C++ detokenizer library was deployed to two Android apps with a Java
1010    Native Interface (JNI) layer.
1011
1012      * The binary token database was included as a raw resource in the APK.
1013      * In one app, the built-in token database could be overridden by copying a
1014        file to the phone.
1015
1016.. tip::
1017  Make the tokenized logging tools simple to use for your project.
1018
1019  * Provide simple wrapper shell scripts that fill in arguments for the
1020    project. For example, point ``detokenize.py`` to the project's token
1021    databases.
1022  * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
1023    continuously-running tools, so that users don't have to restart the tool
1024    when the token database updates.
1025  * Integrate detokenization everywhere it is needed. Integrating the tools
1026    takes just a few lines of code, and token databases can be embedded in
1027    APKs or binaries.
1028
1029Limitations and future work
1030===========================
1031
1032GCC bug: tokenization in template functions
1033-------------------------------------------
1034GCC incorrectly ignores the section attribute for template
1035`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
1036`variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
1037bug, tokenized strings in template functions may be emitted into ``.rodata``
1038instead of the special tokenized string section. This causes two problems:
1039
1040  1. Tokenized strings will not be discovered by the token database tools.
1041  2. Tokenized strings may not be removed from the final binary.
1042
1043clang does **not** have this issue! Use clang to avoid this.
1044
1045It is possible to work around this bug in GCC. One approach would be to tag
1046format strings so that the database tools can find them in ``.rodata``. Then, to
1047remove the strings, compile two binaries: one metadata binary with all tokenized
1048strings and a second, final binary that removes the strings. The strings could
1049be removed by providing the appropriate linker flags or by removing the ``used``
1050attribute from the tokenized string character array declaration.
1051
105264-bit tokenization
1053-------------------
1054The Python and C++ detokenizing libraries currently assume that strings were
1055tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
1056``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
1057device performed the tokenization.
1058
1059Supporting detokenization of strings tokenized on 64-bit targets would be
1060simple. This could be done by adding an option to switch the 32-bit types to
106164-bit. The tokenizer stores the sizes of these types in the
1062``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
1063by checking the ELF file, if necessary.
1064
1065Tokenization in headers
1066-----------------------
1067Tokenizing code in header files (inline functions or templates) may trigger
1068warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
1069is because tokenization requires declaring a character array for each tokenized
1070string. If the tokenized string includes macros that change value, the size of
1071this character array changes, which means the same static variable is defined
1072with different sizes. It should be safe to suppress these warnings, but, when
1073possible, code that tokenizes strings with macros that can change value should
1074be moved to source files rather than headers.
1075
1076Tokenized strings as ``%s`` arguments
1077-------------------------------------
1078Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
1079encoded 1:1, with no tokenization. It would be better to send a tokenized string
1080literal as an integer instead of a string argument, but this is not yet
1081supported.
1082
1083A string token could be sent by marking an integer % argument in a way
1084recognized by the detokenization tools. The detokenizer would expand the
1085argument to the string represented by the integer.
1086
1087.. code-block:: cpp
1088
1089  #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
1090
1091  constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
1092
1093  PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
1094
1095Strings with arguments could be encoded to a buffer, but since printf strings
1096are null-terminated, a binary encoding would not work. These strings can be
1097prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
1098
1099Another possibility: encode strings with arguments to a ``uint64_t`` and send
1100them as an integer. This would be efficient and simple, but only support a small
1101number of arguments.
1102
1103Legacy tokenized string ELF format
1104==================================
1105The original version of ``pw_tokenizer`` stored tokenized stored as plain C
1106strings in the ELF file instead of structured tokenized string entries. Strings
1107in different domains were stored in different linker sections. The Python script
1108that parsed the ELF file would re-calculate the tokens.
1109
1110In the current version of ``pw_tokenizer``, tokenized strings are stored in a
1111structured entry containing a token, domain, and length-delimited string. This
1112has several advantages over the legacy format:
1113
1114* The Python script does not have to recalculate the token, so any hash
1115  algorithm may be used in the firmware.
1116* In C++, the tokenization hash no longer has a length limitation.
1117* Strings with null terminators in them are properly handled.
1118* Only one linker section is required in the linker script, instead of a
1119  separate section for each domain.
1120
1121To migrate to the new format, all that is required is update the linker sections
1122to match those in ``pw_tokenizer_linker_sections.ld``. Replace all
1123``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section.
1124The Python tooling continues to support the legacy tokenized string ELF format.
1125
1126Compatibility
1127=============
1128  * C11
1129  * C++11
1130  * Python 3
1131
1132Dependencies
1133============
1134  * ``pw_varint`` module
1135  * ``pw_preprocessor`` module
1136  * ``pw_span`` module
1137