1AutoFDO and ARM Trace   {#AutoFDO}
2=====================
3
4@brief Using CoreSight trace and perf with OpenCSD for AutoFDO.
5
6## Introduction
7
8Feedback directed optimization (FDO, also know as profile guided
9optimization - PGO) uses a profile of a program's execution to guide the
10optmizations performed by the compiler.  Traditionally, this involves
11building an instrumented version of the program, which records a profile of
12execution as it runs.  The instrumentation adds significant runtime
13overhead, possibly changing the behaviour of the program and it may not be
14possible to run the instrumented program in a production environment
15(e.g. where performance criteria must be met).
16
17AutoFDO uses facilities in the hardware to sample the behaviour of the
18program in the production environment and generate the execution profile.
19An improved profile can be obtained by including the branch history
20(i.e. a record of the last branches taken) when generating an instruction
21samples.  On Arm systems, the ETM can be used to generate such records.
22
23The process can be broken down into the following steps:
24
25* Record execution trace of the program
26* Convert the execution trace to instruction samples with branch histories
27* Convert the instruction samples to source level profiles
28* Use the source level profile with the compiler
29
30This article describes how to enable ETM trace on Arm targets running Linux
31and use the ETM trace to generate AutoFDO profiles and compile an optimized
32program.
33
34
35## Execution trace on Arm targets
36
37Debug and trace of Arm targets is provided by CoreSight.  This consists of
38a set of components that allow access to debug logic, record (trace) the
39execution of a processor and route this data through the system, collecting
40it into a store.
41
42To record the execution of a processor, we require the following
43components:
44
45* A trace source.  The core contains a trace unit, called an ETM that emits
46  data describing the instructions executed by the core.
47* Trace links.  The trace data generated by the ETM must be moved through
48  the system to the component that collects the data (sink).  Links
49  include:
50    * Funnels: merge multiple streams of data
51    * FIFOs: buffer data to smooth out bursts
52    * Replicators: send a stream of data to multiple components
53* Sinks.  These receive the trace data and store it or send it to an
54  external device:
55    * ETB: A small circular buffer (64-128 kilobytes) that stores the most
56      recent data
57    * ETR: A larger (several megabytes) buffer that uses system RAM to
58      store data
59    * TPIU: Sends data to an off-chip capture device (e.g. Arm DSTREAM)
60
61Each Arm SoC design may have a different layout (topology) of components.
62This topology is described to the OS drivers by the platform's devicetree
63or (in future) ACPI firmware.
64
65For application profiling, we need to store several megabytes of data
66within the system, so will use ETR with the capture tool (perf)
67periodically draining the buffer to a file.
68
69Even though we have a large capture buffer, the ETM can still generate a
70lot of data very quickly - typically an ETM will generate ~1 bit of data
71per instruction (depending on the workload), which results in 256Mbytes per
72second for a core running at 2GHz.  This leads to problems storing and
73decoding such large volumes of data.  AutoFDO uses samples of program
74execution, so we can avoid this problem by using the ETM's features to
75only record small slices of execution - e.g. collect ~5000 cycles of data
76every 50M cycles.  This reduces the data rate to a manageable level - a few
77megabytes per minute.  This technique is known as 'strobing'.
78
79
80## Enabling trace
81
82### Driver support
83
84To collect ETM trace, the CoreSight drivers must be included in the
85kernel.  Some of the driver support is not yet included in the mainline
86kernel and many targets are using older kernels.  To enable CoreSight trace
87on these targets, Arm have provided backports of the latest CoreSight
88drivers and ETM strobing patch at:
89
90  <https://gitlab.arm.com/linux-arm/linux-coresight-backports>
91
92This repository can be cloned with:
93
94```
95git clone https://git.gitlab.arm.com/linux-arm/linux-coresight-backports.git
96```
97
98You can include these backports in your kernel by either merging the
99appropriate branch using git or generating patches (using `git
100format-patch`).
101
102For 5.x based kernel onwards, the only patch which needs to be applied is the one enabling strobing - etm4x: `Enable strobing of ETM`.
103
104For 4.9 based kernels, use the `coresight-4.9-etr-etm_strobe` branch:
105
106```
107git merge coresight-4.9-etr-etm_strobe
108```
109
110or
111
112```
113git format-patch --output-directory /output/dir v4.9..coresight-4.9-etr-etm_strobe
114cd my_kernel
115git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
116```
117
118For 4.14 based kernels, use the `coresight-4.14-etm_strobe` branch:
119
120```
121git merge coresight-4.14-etm_strobe
122```
123
124or
125
126```
127git format-patch --output-directory /output/dir v4.14..coresight-4.14-etm_strobe
128cd my_kernel
129git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
130```
131
132The CoreSight trace drivers must also be enabled in the kernel
133configuration.  This can be done using the configuration menu (`make
134menuconfig`), selecting `Kernel hacking` / `arm64 Debugging`  /`CoreSight Tracing Support` and
135enabling all options, or by setting the following in the configuration
136file:
137
138```
139CONFIG_CORESIGHT=y
140CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
141CONFIG_CORESIGHT_SINK_TPIU=y
142CONFIG_CORESIGHT_SOURCE_ETM4X=y
143CONFIG_CORESIGHT_DYNAMIC_REPLICATOR=y
144CONFIG_CORESIGHT_STM=y
145CONFIG_CORESIGHT_CATU=y
146```
147
148Compile the kernel for your target in the usual way, e.g.
149
150```
151make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
152```
153
154Each target may have a different layout of CoreSight components.  To
155collect trace into a sink, the kernel drivers need to know which other
156devices need to be configured to route data from the source to the sink.
157This is described in the devicetree (and in future, the ACPI tables).  The
158device tree will define which CoreSight devices are present in the system,
159where they are located and how they are connected together.  The devicetree
160for some platforms includes a description of the platform's CoreSight
161components, but in other cases you may have to ask the platform/SoC vendor
162to supply it or create it yourself (see Appendix: Describing CoreSight in
163Devicetree).
164
165Once the target has been booted with the devicetree describing the
166CoreSight devices, you should find the devices in sysfs:
167
168```
169# ls /sys/bus/coresight/devices/
170etm0  etm2  etm4  etm6  funnel0  funnel2  funnel4      stm0      tmc_etr0
171etm1  etm3  etm5  etm7  funnel1  funnel3  replicator0  tmc_etf0
172```
173
174The naming convention for etm devices can be different according to the kernel version you're using.
175For more information about the naming scheme, please check out the [Linux Kernel Documentation](https://www.kernel.org/doc/html/latest/trace/coresight/coresight.html#device-naming-scheme)
176
177If `/sys/bus/coresight/devices/` is empty, you may want to check out your Kernel configuration to make sure your .config file is including CoreSight dependencies, such as the clock.
178
179### Perf tools
180
181The perf tool is used to capture execution trace, configuring the trace
182sources to generate trace, routing the data to the sink and collecting the
183data from the sink.
184
185Arm recommends to use the perf version corresponding to the kernel running
186on the target.  This can be built from the same kernel sources with
187
188```
189make -C tools/perf CORESIGHT=1 VF=1 ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
190```
191
192When specifying CORESIGHT=1, perf will be built using the installed OpenCSD library.
193If you are cross compiling, then additional setup is required to ensure the build process links against the correct version of the library.
194
195If the post-processing (`perf inject`) of the captured data is not being
196done on the target, then the OpenCSD library is not required for this build
197of perf.
198
199Trace is captured by collecting the `cs_etm` event from perf.  The sink
200to collect data into is specified as a parameter of this event.  Trace can
201also be restricted to user space or kernel space with 'u' or 'k'
202parameters.  For example:
203
204```
205perf record -e cs_etm/@tmc_etr0/u --per-thread -- /bin/ls
206```
207
208Will record the userspace execution of '/bin/ls' using tmc_etr0 as sink.
209
210## Capturing modes
211
212You can trace a single-threaded program in two different ways:
213
2141. By specifying `--per-thread`, and in this case the CoreSight subsystem will
215record only a trace relative to the given program.
216
2172. By NOT specifying `--per-thread`, and in this case CPU-wide tracing will
218be enabled. In this scenario the trace will contain both the target program trace
219and other workloads that were executing on the same CPU
220
221
222
223## Processing trace and profiles
224
225perf is also used to convert the execution trace an instruction profile.
226This requires a different build of perf, using the version of perf from
227Linux v4.17 or later, as the trace processing code isn't included in the
228driver backports.  Trace decode is provided by the OpenCSD library
229(<https://github.com/Linaro/OpenCSD>), v0.9.1 or later.  This is packaged
230for debian testing (install the libopencsd0, libopencsd-dev packages) or
231can be compiled from source and installed.
232
233The autoFDO tool <https://github.com/google/autofdo> is used to convert the
234instruction profiles to source profiles for the GCC and clang/llvm
235compilers.
236
237
238## Recording and profiling
239
240Once trace collection using perf is working, we can now use it to profile
241an application.
242
243The application must be compiled to include sufficient debug information to
244map instructions back to source lines.  For GCC, use the `-g1` or `-gmlt`
245options.  For clang/llvm, also add the `-fdebug-info-for-profiling` option.
246
247perf identifies the active program or library using the build identifier
248stored in the elf file.  This should be added at link time with the compiler
249flag `-Wl,--build-id=sha1`.
250
251The next step is to record the execution trace of the application using the
252perf tool.  The ETM strobing should be configured before running the perf
253tool.  There are two parameters:
254
255  * window size: A number of CPU cycles (W)
256  * period: Trace is enabled for W cycle every _period_ * W cycles.
257
258For example, a typical configuration is to use a window size of 5000 cycles
259and a period of 10000 - this will collect 5000 cycles of trace every 50M
260cycles.  With these proof-of-concept patches, the strobe parameters are
261configured via sysfs - each ETM will have `strobe_window` and
262`strobe_period` parameters in `/sys/bus/coresight/devices/<sink>` and
263these values will have to be written to each (In a future version, this
264will be integrated into the drivers and perf tool).
265The `set_strobing.sh` script in this directory [`<opencsd>/decoder/tests/auto-fdo`] automates this process.
266
267To collect trace from an application using ETM strobing, run:
268
269```
270sudo ./set_strobing.sh 5000 10000
271perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>"
272```
273
274The raw trace can be examined using the `perf report` command:
275
276```
277perf report -D -i perf.data --stdio
278```
279
280Perf needs to be built from your linux kernel version souce code repository against the OpenCSD library in order to be able to properly read ETM-gathered samples and post-process them.
281If running `perf report` produces an error like:
282
283```
2840x1f8 [0x268]: failed to process type: 70 [Operation not permitted]
285Error:
286failed to process sample
287```
288or
289
290```
291"file uses a more recent and unsupported ABI (8 bytes extra). incompatible file format".
292```
293
294You are probably using a perf version which is not using this library: please make sure to install this project in your system by either compiling it from [Source Code]( <https://github.com/Linaro/OpenCSD>) from v0.9.1 or later and compile perf using this library.
295Otherwise, this project is packaged for debian (install the libopencsd0, libopencsd-dev packages).
296
297
298For example:
299
300```
3010x1d370 [0x30]: PERF_RECORD_AUXTRACE size: 0x2003c0  offset: 0  ref: 0x39ba881d145f8639  idx: 0  tid: 4551  cpu: -1
302
303. ... CoreSight ETM Trace data: size 2098112 bytes
304        Idx:0; ID:12;   I_ASYNC : Alignment Synchronisation.
305        Idx:12; ID:12;  I_TRACE_INFO : Trace Info.; INFO=0x0
306        Idx:17; ID:12;  I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
307        Idx:48; ID:14;  I_ASYNC : Alignment Synchronisation.
308        Idx:60; ID:14;  I_TRACE_INFO : Trace Info.; INFO=0x0
309        Idx:65; ID:14;  I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
310        Idx:96; ID:14;  I_ASYNC : Alignment Synchronisation.
311        Idx:108; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0
312        Idx:113; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
313        Idx:122; ID:14; I_TRACE_ON : Trace On.
314        Idx:123; ID:14; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x0000000000407B00; Ctxt: AArch64,EL0, NS;
315        Idx:134; ID:14; I_ATOM_F3 : Atom format 3.; ENN
316        Idx:135; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
317        Idx:136; ID:14; I_ATOM_F5 : Atom format 5.; ENENE
318        Idx:137; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
319        Idx:138; ID:14; I_ATOM_F3 : Atom format 3.; ENN
320        Idx:139; ID:14; I_ATOM_F3 : Atom format 3.; NNE
321        Idx:140; ID:14; I_ATOM_F1 : Atom format 1.; E
322.....
323```
324
325The execution trace is then converted to an instruction profile using
326the perf build with trace decode support.  This may be done on a different
327machine than that which collected the trace (e.g. when cross compiling for
328an embedded target).  The `perf inject` command
329decodes the execution trace and generates periodic instruction samples,
330with branch histories:
331
332!! Careful: if you are using a device different than the one used to collect the profiling data,
333you'll need to run `perf buildid-cache` as described below.
334```
335perf inject -i perf.data -o inj.data --itrace=i100000il
336```
337
338The `--itrace` option configures the instruction sample behaviour:
339
340* `i100000i` generates an instruction sample every 100000 instructions
341  (only instruction count periods are currently supported, future versions
342  may support time or cycle count periods)
343* `l` includes the branch histories on each sample
344* `b` generates a sample on each branch (not used here)
345
346Perf requires the original program binaries to decode the execution trace.
347If running the `inject` command on a different system than the trace was
348captured on, then the binary and any shared libraries must be added to
349perf's cache with:
350
351```
352perf buildid-cache -a /path/to/binary_or_library
353```
354
355`perf report` can also be used to show the instruction samples:
356
357```
358perf report -D -i inj.data --stdio
359.......
3600x1528 [0x630]: PERF_RECORD_SAMPLE(IP, 0x2): 4551/4551: 0x434b98 period: 3093 addr: 0
361... branch stack: nr:64
362.....  0: 0000000000434b58 -> 0000000000434b68 0 cycles  P   0
363.....  1: 0000000000436a88 -> 0000000000434b4c 0 cycles  P   0
364.....  2: 0000000000436a64 -> 0000000000436a78 0 cycles  P   0
365.....  3: 00000000004369d0 -> 0000000000436a60 0 cycles  P   0
366.....  4: 000000000043693c -> 00000000004369cc 0 cycles  P   0
367.....  5: 00000000004368a8 -> 0000000000436928 0 cycles  P   0
368.....  6: 000000000042d070 -> 00000000004368a8 0 cycles  P   0
369.....  7: 000000000042d108 -> 000000000042d070 0 cycles  P   0
370.......
371..... 57: 0000000000448ee0 -> 0000000000448f24 0 cycles  P   0
372..... 58: 0000000000448ea4 -> 0000000000448ebc 0 cycles  P   0
373..... 59: 0000000000448e20 -> 0000000000448e94 0 cycles  P   0
374..... 60: 0000000000448da8 -> 0000000000448ddc 0 cycles  P   0
375..... 61: 00000000004486f4 -> 0000000000448da8 0 cycles  P   0
376..... 62: 00000000004480fc -> 00000000004486d4 0 cycles  P   0
377..... 63: 0000000000448658 -> 00000000004480ec 0 cycles  P   0
378 ... thread: program1:4551
379 ...... dso: /home/root/program1
380.......
381```
382
383The instruction samples produced by `perf inject` is then passed to the
384autofdo tool to generate source level profiles for the compiler.  For
385clang/LLVM:
386
387```
388create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
389```
390
391And for GCC:
392
393```
394create_gcov -binary=/path/to/binary -profile=inj.data -gcov_version=1 -gcov=program.gcov
395```
396
397The profiles can be viewed with:
398
399```
400llvm-profdata show -sample program.llvmprof
401```
402
403Or, for GCC:
404
405```
406dump_gcov -gcov_version=1 program.gcov
407```
408
409## Using profile in the compiler
410
411The profile produced by the above steps can then be passed to the compiler
412to optimize the next build of the program.
413
414For GCC, use the `-fauto-profile` option:
415
416```
417gcc -O2 -fauto-profile=program.gcov -o program program.c
418```
419
420For Clang, use the `-fprofile-sample-use` option:
421
422```
423clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c
424```
425
426
427## Summary
428
429The basic commands to run an application and create a compiler profile are:
430
431```
432sudo ./set_strobing.sh 5000 10000
433perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>"
434perf inject -i perf.data -o inj.data --itrace=i100000il
435create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
436clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c
437```
438
439Use `create_gcov` for gcc.
440
441## High Level Summary for recoding on Arm board and decoding on different host
442
4431. (on Arm board)
444
445        sudo ./set_strobing.sh 5000 10000
446        perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>.
447	If you specify `-N, --no-buildid-cache`, perf will just take care of recording the target binary and nothing will be copied.<br>  If you don't specify it, any recorded dynamic library will be copied to ~/.debug in the board.
448
4492. (on Arm board) `perf archive` which saves all the found libraries in a tar (internally, it looks into perf.data file and performs a lookup using perf-buildid-list --with-hits)
4503. (on host) `scp` to copy perf.data and the .tar file generated from `perf archive`.
4514. (on host) Run `tar xvf perf_data.tar.bz2 -C ~/.debug` to populate the buildid-cache
4525. (on host) Double check the setup is correct:
453
454       a. `perf buildid-list -i perf.data` gives you the list of dynamic libraries buildids whose trace has been recorded and saved in perf.data.
455       b. `perf buildid-cache --list` lists the dynamic libraries in the buildid cache that will be used by `perf inject`.
456	Make sure the output of (a) and (b) overlaps as in buildid value for those binaries you are interested into optimizing with afdo.
457
4586. (on host) `perf inject -i perf.data -o inj.data --itrace=i100000il` will check for the dynamic libraries using the buildid inside the buildid-cache and post-process the trace.<br>  buildids have to be the same, otherwise it won't be possible to post-process the trace.
459
4607. (on host) `create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof` takes the output from perf-inject and tranforms it into a format that the compiler can read.
4618. (on host) `clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c` to make clang use the produced profile.<br>
462	If you are confident enough that your profile is accurate, you can add the `-fprofile-sample-accurate` flag, which will penalize all the callsites without corresponding profile, marking them as cold.
463
464If you are using the same host for both building the binary to be traced and re-building it with afdo:
465
4661. You won't need to copy back any dynamic libraries from the board (since you already have them), and can use `--no-buildid-cache` when recording
4672. You have to make sure the relevant dynamic libraries to be optimized are present in the buildid-cache.
468
469You can easily add a dynamic library manually into the build-id cache by running:
470
471`perf buildid-cache --add <path/to/library/or/binary> -vvv`
472
473You can easily check what is currently contained in you buildid-cache by running:
474
475`perf buildid-cache --list`
476
477You can check the buildid of a given binary/dynamic library:
478
479`file <path/to/dynamic/library>`
480
481## References
482
483* AutoFDO tool: <https://github.com/google/autofdo>
484* GCC's wiki on autofdo: <https://gcc.gnu.org/wiki/AutoFDO>, <https://gcc.gnu.org/wiki/AutoFDO/Tutorial>
485* Google paper: <https://ai.google/research/pubs/pub45290>
486* CoreSight kernel docs: Documentation/trace/coresight.txt
487
488
489## Appendix: Describing CoreSight in Devicetree
490
491
492Each component has an entry in the device tree that describes its:
493
494* type: The `compatible` field defines which driver to use
495* location: A `reg` defines the component's address and size on the bus
496* clocks: The `clocks` and `clock-names` fields state which clock provides
497  the `apb_pclk` clock.
498* connections to other components: `port` and `ports` field link the
499  component to ports of other components
500
501To create the device tree, some information about the platform is required:
502
503* The memory address of the CoreSight components.  This is the address in
504  the CPU's address space where the CPU can access each CoreSight
505  component.
506* The connections between the components.
507
508This information can be found in the SoC's reference manual or you may need
509to ask the platform/SoC vendor to supply it.
510
511An ETMv4 source is declared with a section like this:
512
513```
514	etm0: etm@22040000 {
515		compatible = "arm,coresight-etm4x", "arm,primecell";
516		reg = <0 0x22040000 0 0x1000>;
517
518		cpu = <&A72_0>;
519		clocks = <&soc_smc50mhz>;
520		clock-names = "apb_pclk";
521		port {
522			cluster0_etm0_out_port: endpoint {
523				remote-endpoint = <&cluster0_funnel_in_port0>;
524			};
525		};
526	};
527```
528
529This describes an ETMv4 attached to core A72_0, located at 0x22040000, with
530its output linked to port 0 of a funnel.  The funnel is described with:
531
532```
533	funnel@220c0000 { /* cluster0 funnel */
534		compatible = "arm,coresight-funnel", "arm,primecell";
535		reg = <0 0x220c0000 0 0x1000>;
536
537		clocks = <&soc_smc50mhz>;
538		clock-names = "apb_pclk";
539		power-domains = <&scpi_devpd 0>;
540		ports {
541			#address-cells = <1>;
542			#size-cells = <0>;
543
544			port@0 {
545				reg = <0>;
546				cluster0_funnel_out_port: endpoint {
547					remote-endpoint = <&main_funnel_in_port0>;
548				};
549			};
550
551			port@1 {
552				reg = <0>;
553				cluster0_funnel_in_port0: endpoint {
554					slave-mode;
555					remote-endpoint = <&cluster0_etm0_out_port>;
556				};
557			};
558
559			port@2 {
560				reg = <1>;
561				cluster0_funnel_in_port1: endpoint {
562					slave-mode;
563					remote-endpoint = <&cluster0_etm1_out_port>;
564				};
565			};
566		};
567	};
568```
569
570This describes a funnel located at 0x220c0000, receiving data from 2 ETMs
571and sending the merged data to another funnel.  We continue describing
572components with similar blocks until we reach the sink (an ETR):
573
574```
575	etr@20070000 {
576		compatible = "arm,coresight-tmc", "arm,primecell";
577		reg = <0 0x20070000 0 0x1000>;
578		iommus = <&smmu_etr 0>;
579
580		clocks = <&soc_smc50mhz>;
581		clock-names = "apb_pclk";
582		power-domains = <&scpi_devpd 0>;
583		port {
584			etr_in_port: endpoint {
585				slave-mode;
586				remote-endpoint = <&replicator_out_port1>;
587			};
588		};
589	};
590```
591
592Full descriptions of the properties of each component can be found in the
593Linux source at Documentation/devicetree/bindings/arm/coresight.txt.
594The Arm Juno platform's devicetree (arch/arm64/boot/dts/arm) provides an example
595description of CoreSight description.
596
597Many systems include a TPIU for off-chip trace.  While this isn't required
598for self-hosted trace, it should still be included in the devicetree.  This
599allows the drivers to access it to ensure it is put into a disabled state,
600otherwise it may limit the trace bandwidth causing data loss.
601