1====================================== 2XRay Flight Data Recorder Trace Format 3====================================== 4 5:Version: 1 as of 2017-07-20 6 7.. contents:: 8 :local: 9 10 11Introduction 12============ 13 14When gathering XRay traces in Flight Data Recorder mode, each thread of an 15application will claim buffers to fill with trace data, which at some point 16is finalized and flushed. 17 18A goal of the profiler is to minimize overhead, the flushed data directly 19corresponds to the buffer. 20 21This document describes the format of a trace file. 22 23 24General 25======= 26 27Each trace file corresponds to a sequence of events in a particular thread. 28 29The file has a header followed by a sequence of discriminated record types. 30 31The endianness of byte fields matches the endianess of the platform which 32produced the trace file. 33 34 35Header Section 36============== 37 38A trace file begins with a 32 byte header. 39 40+-------------------+-----------------+----------------------------------------+ 41| Field | Size (bytes) | Description | 42+===================+=================+========================================+ 43| version | ``2`` | Anticipates versioned readers. This | 44| | | document describes the format when | 45| | | version == 1 | 46+-------------------+-----------------+----------------------------------------+ 47| type | ``2`` | An enumeration encoding the type of | 48| | | trace. Flight Data Recorder mode | 49| | | traces have type == 1 | 50+-------------------+-----------------+----------------------------------------+ 51| bitfield | ``4`` | Holds parameters that are not aligned | 52| | | to bytes. Further described below. | 53+-------------------+-----------------+----------------------------------------+ 54| cycle_frequency | ``8`` | The frequency in hertz of the CPU | 55| | | oscillator used to measure duration of | 56| | | events in ticks. | 57+-------------------+-----------------+----------------------------------------+ 58| buffer_size | ``8`` | The size in bytes of the data portion | 59| | | of the trace following the header. | 60+-------------------+-----------------+----------------------------------------+ 61| reserved | ``8`` | Reserved for future use. | 62+-------------------+-----------------+----------------------------------------+ 63 64The bitfield parameter of the file header is composed of the following fields. 65 66+-------------------+----------------+-----------------------------------------+ 67| Field | Size (bits) | Description | 68+===================+================+=========================================+ 69| constant_tsc | ``1`` | Whether the platform's timestamp | 70| | | counter used to record ticks between | 71| | | events ticks at a constant frequency | 72| | | despite CPU frequency changes. | 73| | | 0 == non-constant. 1 == constant. | 74+-------------------+----------------+-----------------------------------------+ 75| nonstop_tsc | ``1`` | Whether the tsc continues to count | 76| | | despite whether the CPU is in a low | 77| | | power state. 0 == stop. 1 == non-stop. | 78+-------------------+----------------+-----------------------------------------+ 79| reserved | ``30`` | Not meaningful. | 80+-------------------+----------------+-----------------------------------------+ 81 82 83Data Section 84============ 85 86Following the header in a trace is a data section with size matching the 87buffer_size field in the header. 88 89The data section is a stream of elements of different types. 90 91There are a few categories of data in the sequence. 92 93- ``Function Records``: Function Records contain the timing of entry into and 94 exit from function execution. Function Records have 8 bytes each. 95 96- ``Metadata Records``: Metadata records serve many purposes. Mostly, they 97 capture information that may be too costly to record for each function, but 98 that is required to contextualize the fine-grained timings. They also are used 99 as markers for user-defined Event Data payloads. Metadata records have 16 100 bytes each. 101 102- ``Event Data``: Free form data may be associated with events that are traced 103 by the binary and encode data defined by a handler function. Event data is 104 always preceded with a marker record which indicates how large it is. 105 106- ``Function Arguments``: The arguments to some functions are included in the 107 trace. These are either pointer addresses or primitives that are read and 108 logged independently of their types in a high level language. To the tracer, 109 they are all numbers. Function Records that have attached arguments will 110 indicate their presence on the function entry record. We only support logging 111 contiguous function argument sequences starting with argument zero, which will 112 be the "this" pointer for member function invocations. For example, we don't 113 support logging the first and third argument. 114 115A reader of the memory format must maintain a state machine. The format makes no 116attempt to pad for alignment, and it is not seekable. 117 118 119Function Records 120---------------- 121 122Function Records have an 8 byte layout. This layout encodes information to 123reconstruct a call stack of instrumented function and their durations. 124 125+---------------+--------------+-----------------------------------------------+ 126| Field | Size (bits) | Description | 127+===============+==============+===============================================+ 128| discriminant | ``1`` | Indicates whether a reader should read a | 129| | | Function or Metadata record. Set to ``0`` for | 130| | | Function records. | 131+---------------+--------------+-----------------------------------------------+ 132| action | ``3`` | Specifies whether the function is being | 133| | | entered, exited, or is a non-standard entry | 134| | | or exit produced by optimizations. | 135+---------------+--------------+-----------------------------------------------+ 136| function_id | ``28`` | A numeric ID for the function. Resolved to a | 137| | | name via the xray instrumentation map. The | 138| | | instrumentation map is built by xray at | 139| | | compile time into an object file and pairs | 140| | | the function ids to addresses. It is used for | 141| | | patching and as a lookup into the binary's | 142| | | symbols to obtain names. | 143+---------------+--------------+-----------------------------------------------+ 144| tsc_delta | ``32`` | The number of ticks of the timestamp counter | 145| | | since a previous record recorded a delta or | 146| | | other TSC resetting event. | 147+---------------+--------------+-----------------------------------------------+ 148 149On little-endian machines, the bitfields are ordered from least significant bit 150bit to most significant bit. A reader can read an 8 bit value and apply the mask 151``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned 152shift right by ``0x04`` to obtain the function_id field. 153 154On big-endian machine, the bitfields are written in order from most significant 155bit to least significant bit. A reader would read an 8 bit value and unsigned 156shift right by 7 bits for the discriminant. The function_id field could be 157obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``. 158 159Function action types are as follows. 160 161+---------------+--------------+-----------------------------------------------+ 162| Type | Number | Description | 163+===============+==============+===============================================+ 164| Entry | ``0`` | Typical function entry. | 165+---------------+--------------+-----------------------------------------------+ 166| Exit | ``1`` | Typical function exit. | 167+---------------+--------------+-----------------------------------------------+ 168| Tail_Exit | ``2`` | An exit from a function due to tail call | 169| | | optimization. | 170+---------------+--------------+-----------------------------------------------+ 171| Entry_Args | ``3`` | A function entry that records arguments. | 172+---------------+--------------+-----------------------------------------------+ 173 174Entry_Args records do not contain the arguments themselves. Instead, metadata 175records for each of the logged args follow the function record in the stream. 176 177 178Metadata Records 179---------------- 180 181Interspersed throughout the buffer are 16 byte Metadata records. For typically 182instrumented binaries, they will be sparser than Function records, and they 183provide a fuller picture of the binary execution state. 184 185Metadata record layout is partially record dependent, but they share a common 186structure. 187 188The same bit field rules described for function records apply to the first byte 189of MetadataRecords. Within this byte, little endian machines use lsb to msb 190ordering and big endian machines use msb to lsb ordering. 191 192+---------------+--------------+-----------------------------------------------+ 193| Field | Size | Description | 194+===============+==============+===============================================+ 195| discriminant | ``1 bit`` | Indicates whether a reader should read a | 196| | | Function or Metadata record. Set to ``1`` for | 197| | | Metadata records. | 198+---------------+--------------+-----------------------------------------------+ 199| record_kind | ``7 bits`` | The type of Metadata record. | 200+---------------+--------------+-----------------------------------------------+ 201| data | ``15 bytes`` | A data field used differently for each record | 202| | | type. | 203+---------------+--------------+-----------------------------------------------+ 204 205Here is a table of the enumerated record kinds. 206 207+--------+---------------------------+ 208| Number | Type | 209+========+===========================+ 210| 0 | NewBuffer | 211+--------+---------------------------+ 212| 1 | EndOfBuffer | 213+--------+---------------------------+ 214| 2 | NewCPUId | 215+--------+---------------------------+ 216| 3 | TSCWrap | 217+--------+---------------------------+ 218| 4 | WallTimeMarker | 219+--------+---------------------------+ 220| 5 | CustomEventMarker | 221+--------+---------------------------+ 222| 6 | CallArgument | 223+--------+---------------------------+ 224 225 226NewBuffer Records 227----------------- 228 229Each buffer begins with a NewBuffer record immediately after the header. 230It records the thread ID of the thread that the trace belongs to. 231 232Its data segment is as follows. 233 234+---------------+--------------+-----------------------------------------------+ 235| Field | Size (bytes) | Description | 236+===============+==============+===============================================+ 237| thread_Id | ``2`` | Thread ID for buffer. | 238+---------------+--------------+-----------------------------------------------+ 239| reserved | ``13`` | Unused. | 240+---------------+--------------+-----------------------------------------------+ 241 242 243WallClockTime Records 244--------------------- 245 246Following the NewBuffer record, each buffer records an absolute time as a frame 247of reference for the durations recorded by timestamp counter deltas. 248 249Its data segment is as follows. 250 251+---------------+--------------+-----------------------------------------------+ 252| Field | Size (bytes) | Description | 253+===============+==============+===============================================+ 254| seconds | ``8`` | Seconds on absolute timescale. The starting | 255| | | point is unspecified and depends on the | 256| | | implementation and platform configured by the | 257| | | tracer. | 258+---------------+--------------+-----------------------------------------------+ 259| microseconds | ``4`` | The microsecond component of the time. | 260+---------------+--------------+-----------------------------------------------+ 261| reserved | ``3`` | Unused. | 262+---------------+--------------+-----------------------------------------------+ 263 264 265NewCpuId Records 266---------------- 267 268Each function entry invokes a routine to determine what CPU is executing. 269Typically, this is done with readtscp, which reads the timestamp counter at the 270same time. 271 272If the tracing detects that the execution has switched CPUs or if this is the 273first instrumented entry point, the tracer will output a NewCpuId record. 274 275Its data segment is as follows. 276 277+---------------+--------------+-----------------------------------------------+ 278| Field | Size (bytes) | Description | 279+===============+==============+===============================================+ 280| cpu_id | ``2`` | CPU Id. | 281+---------------+--------------+-----------------------------------------------+ 282| absolute_tsc | ``8`` | The absolute value of the timestamp counter. | 283+---------------+--------------+-----------------------------------------------+ 284| reserved | ``5`` | Unused. | 285+---------------+--------------+-----------------------------------------------+ 286 287 288TSCWrap Records 289--------------- 290 291Since each function record uses a 32 bit value to represent the number of ticks 292of the timestamp counter since the last reference, it is possible for this value 293to overflow, particularly for sparsely instrumented binaries. 294 295When this delta would not fit into a 32 bit representation, a reference absolute 296timestamp counter record is written in the form of a TSCWrap record. 297 298Its data segment is as follows. 299 300+---------------+--------------+-----------------------------------------------+ 301| Field | Size (bytes) | Description | 302+===============+==============+===============================================+ 303| absolute_tsc | ``8`` | Timestamp counter value. | 304+---------------+--------------+-----------------------------------------------+ 305| reserved | ``7`` | Unused. | 306+---------------+--------------+-----------------------------------------------+ 307 308 309CallArgument Records 310-------------------- 311 312Immediately following an Entry_Args type function record, there may be one or 313more CallArgument records that contain the traced function's parameter values. 314 315The order of the CallArgument Record sequency corresponds one to one with the 316order of the function parameters. 317 318CallArgument data segment: 319 320+---------------+--------------+-----------------------------------------------+ 321| Field | Size (bytes) | Description | 322+===============+==============+===============================================+ 323| argument | ``8`` | Numeric argument (may be pointer address). | 324+---------------+--------------+-----------------------------------------------+ 325| reserved | ``7`` | Unused. | 326+---------------+--------------+-----------------------------------------------+ 327 328 329CustomEventMarker Records 330------------------------- 331 332XRay provides the feature of logging custom events. This may be leveraged to 333record tracing info for RPCs or similarly trace data that is application 334specific. 335 336Custom Events themselves are an unstructured (application defined) segment of 337memory with arbitrary size within the buffer. They are preceded by 338CustomEventMarkers to indicate their presence and size. 339 340CustomEventMarker data segment: 341 342+---------------+--------------+-----------------------------------------------+ 343| Field | Size (bytes) | Description | 344+===============+==============+===============================================+ 345| event_size | ``4`` | Size of preceded event. | 346+---------------+--------------+-----------------------------------------------+ 347| absolute_tsc | ``8`` | A timestamp counter of the event. | 348+---------------+--------------+-----------------------------------------------+ 349| reserved | ``3`` | Unused. | 350+---------------+--------------+-----------------------------------------------+ 351 352 353EndOfBuffer Records 354------------------- 355 356An EndOfBuffer record type indicates that there is no more trace data in this 357buffer. The reader is expected to seek past the remaining buffer_size expressed 358before the start of buffer and look for either another header or EOF. 359 360 361Format Grammar and Invariants 362============================= 363 364Not all sequences of Metadata records and Function records are valid data. A 365sequence should be parsed as a state machine. The expectations for a valid 366format can be expressed as a context free grammar. 367 368This is an attempt to explain the format with statements in EBNF format. 369 370- Format := Header ThreadBuffer* EOF 371 372- ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End 373 374- BodySequence := NewCPUId | TSCWrap | Function | CustomEvent 375 376- Function := (Function_Entry_Args CallArgument*) | Function_Other_Type 377 378- CustomEvent := CustomEventMarker CustomEventUnstructuredMemory 379 380- End := EndOfBuffer RemainingBufferSizeToSkip 381 382 383Function Record Order 384--------------------- 385 386There are a few clarifications that may help understand what is expected of 387Function records. 388 389- Functions with an Exit are expected to have a corresponding Entry or 390 Entry_Args function record precede them in the trace. 391 392- Tail_Exit Function records record the Function ID of the function whose return 393 address the program counter will take. In other words, the final function that 394 would be popped off of the call stack if tail call optimization was not used. 395 396- Not all functions marked for instrumentation are necessarily in the trace. The 397 tracer uses heuristics to preserve the trace for non-trivial functions. 398 399- Not every entry must have a traced Exit or Tail Exit. The buffer may run out 400 of space or the program may request for the tracer to finalize toreturn the 401 buffer before an instrumented function exits. 402