1Demonstrations of trace.
2
3
4trace probes functions you specify and displays trace messages if a particular
5condition is met. You can control the message format to display function
6arguments and return values.
7
8For example, suppose you want to trace all commands being exec'd across the
9system:
10
11# trace 'sys_execve "%s", arg1'
12PID    COMM         FUNC             -
134402   bash         sys_execve       /usr/bin/man
144411   man          sys_execve       /usr/local/bin/less
154411   man          sys_execve       /usr/bin/less
164410   man          sys_execve       /usr/local/bin/nroff
174410   man          sys_execve       /usr/bin/nroff
184409   man          sys_execve       /usr/local/bin/tbl
194409   man          sys_execve       /usr/bin/tbl
204408   man          sys_execve       /usr/local/bin/preconv
214408   man          sys_execve       /usr/bin/preconv
224415   nroff        sys_execve       /usr/bin/locale
234416   nroff        sys_execve       /usr/bin/groff
244418   groff        sys_execve       /usr/bin/grotty
254417   groff        sys_execve       /usr/bin/troff
26^C
27
28The ::sys_execve syntax specifies that you want an entry probe (which is the
29default), in a kernel function (which is the default) called sys_execve. Next,
30the format string to print is simply "%s", which prints a string. Finally, the
31value to print is the first argument to the sys_execve function, which happens
32to be the command that is exec'd. The above trace was generated by executing
33"man ls" in a separate shell. As you see, man executes a number of additional
34programs to finally display the man page.
35
36Next, suppose you are looking for large reads across the system. Let's trace
37the read system call and inspect the third argument, which is the number of
38bytes to be read:
39
40# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
41PID    COMM         FUNC             -
424490   dd           sys_read         read 1048576 bytes
434490   dd           sys_read         read 1048576 bytes
444490   dd           sys_read         read 1048576 bytes
454490   dd           sys_read         read 1048576 bytes
46^C
47
48During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4".
49The individual reads are visible, with the custom format message printed for
50each read. The parenthesized expression "(arg3 > 20000)" is a filter that is
51evaluated for each invocation of the probe before printing anything.
52
53You can also trace user functions. For example, let's simulate the bashreadline
54script, which attaches to the readline function in bash and prints its return
55value, effectively snooping all bash shell input across the system:
56
57# trace 'r:bash:readline "%s", retval'
58PID    COMM         FUNC             -
592740   bash         readline         echo hi!
602740   bash         readline         man ls
61^C
62
63The special retval keyword stands for the function's return value, and can
64be used only in a retprobe, specified by the 'r' prefix. The next component
65of the probe is the library that contains the desired function. It's OK to
66specify executables too, as long as they can be found in the PATH. Or, you
67can specify the full path to the executable (e.g. "/usr/bin/bash").
68
69Sometimes it can be useful to see where in code the events happen. There are
70flags to print the kernel stack (-K), the user stack (-U) and optionally
71include the virtual address in the stacks as well (-a):
72
73# trace.py -U -a 'r::sys_futex "%d", retval'
74PID     TID     COMM            FUNC             -
75793922  793951  poller          sys_futex        0
76        7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so]
77              627fef folly::FunctionScheduler::run()+0x46f [router]
78        7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21]
79        7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so]
80        7f6c7223fa7d clone+0x6d [libc-2.23.so]
81
82Multiple probes can be combined on the same command line. For example, let's
83trace failed read and write calls on the libc level, and include a time column:
84
85# trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \
86        'r:c:write ((int)retval < 0) "write failed: %d", retval' -T
87TIME     PID    COMM         FUNC             -
8805:31:57 3388   bash         write            write failed: -1
8905:32:00 3388   bash         write            write failed: -1
90^C
91
92Note that the retval variable must be cast to int before comparing to zero.
93The reason is that the default type for argN and retval is an unsigned 64-bit
94integer, which can never be smaller than 0.
95
96trace has also some basic support for kernel tracepoints. For example, let's
97trace the block:block_rq_complete tracepoint and print out the number of sectors
98transferred:
99
100# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T
101TIME     PID    COMM         FUNC             -
10201:23:51 0      swapper/0    block_rq_complete sectors=8
10301:23:55 10017  kworker/u64: block_rq_complete sectors=1
10401:23:55 0      swapper/0    block_rq_complete sectors=8
105^C
106
107To discover the tracepoint structure format (which you can refer to as the "args"
108pointer variable), use the tplist tool. For example:
109
110# tplist -v block:block_rq_complete
111block:block_rq_complete
112    dev_t dev;
113    sector_t sector;
114    unsigned int nr_sector;
115    int errors;
116    char rwbs[8];
117
118This output tells you that you can use "args->dev", "args->sector", etc. in your
119predicate and trace arguments.
120
121
122More and more high-level libraries are instrumented with USDT probe support.
123These probes can be traced by trace just like kernel tracepoints. For example,
124trace new threads being created and their function name, include time column
125and on which CPU it happened:
126
127# trace 'u:pthread:pthread_create "%U", arg3' -T -C
128TIME     CPU PID     TID     COMM            FUNC             -
12913:22:01 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
13013:22:01 5   21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
13113:22:03 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
13213:22:04 15  21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
13313:22:07 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
13413:22:07 4   21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
135^C
136
137The "%U" format specifier tells trace to resolve arg3 as a user-space symbol,
138if possible. Similarly, use "%K" for kernel symbols.
139
140Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's
141trace Ruby methods being called (this requires a version of Ruby built with
142the --enable-dtrace configure flag):
143
144# trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T
145TIME     PID    COMM         FUNC             -
14612:08:43 18420  irb          method__entry    IRB::Context.verbose?
14712:08:43 18420  irb          method__entry    RubyLex.ungetc
14812:08:43 18420  irb          method__entry    RuxyLex.debug?
149^C
150
151In the previous invocation, arg1 and arg2 are the class name and method name
152for the Ruby method being invoked.
153
154You can also trace exported functions from shared libraries, or an imported
155function on the actual executable:
156
157# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
158# tput -V
159
160PID    TID    COMM         FUNC             -
16121720  21720  tput         curses_version   Version=ncurses 6.0.20160709
162^C
163
164
165Occasionally, it can be useful to filter specific strings. For example, you
166might be interested in open() calls that open a specific file:
167
168# trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T
169TIME     PID    COMM         FUNC             -
17001:43:15 10938  cat          open             opening test.txt
17101:43:20 10939  cat          open             opening test.txt
172^C
173
174
175In the preceding example, as well as in many others, readability may be
176improved by providing the function's signature, which names the arguments and
177lets you access structure sub-fields, which is hard with the "arg1", "arg2"
178convention. For example:
179
180# trace 'p:c:open(char *filename) "opening %s", filename'
181PID    TID    COMM         FUNC             -
18217507  17507  cat          open             opening FAQ.txt
183^C
184
185# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
186PID    TID    COMM         FUNC             -
187777    785    automount    SyS_nanosleep    sleep for 500000000 ns
188777    785    automount    SyS_nanosleep    sleep for 500000000 ns
189777    785    automount    SyS_nanosleep    sleep for 500000000 ns
190777    785    automount    SyS_nanosleep    sleep for 500000000 ns
191^C
192
193Remember to use the -I argument include the appropriate header file. We didn't
194need to do that here because `struct timespec` is used internally by the tool,
195so it always includes this header file.
196
197
198As a final example, let's trace open syscalls for a specific process. By
199default, tracing is system-wide, but the -p switch overrides this:
200
201# trace -p 2740 'do_sys_open "%s", arg2' -T
202TIME     PID    COMM         FUNC             -
20305:36:16 15872  ls           do_sys_open      /etc/ld.so.cache
20405:36:16 15872  ls           do_sys_open      /lib64/libselinux.so.1
20505:36:16 15872  ls           do_sys_open      /lib64/libcap.so.2
20605:36:16 15872  ls           do_sys_open      /lib64/libacl.so.1
20705:36:16 15872  ls           do_sys_open      /lib64/libc.so.6
20805:36:16 15872  ls           do_sys_open      /lib64/libpcre.so.1
20905:36:16 15872  ls           do_sys_open      /lib64/libdl.so.2
21005:36:16 15872  ls           do_sys_open      /lib64/libattr.so.1
21105:36:16 15872  ls           do_sys_open      /lib64/libpthread.so.0
21205:36:16 15872  ls           do_sys_open      /usr/lib/locale/locale-archive
21305:36:16 15872  ls           do_sys_open      /home/vagrant
214^C
215
216In this example, we traced the "ls ~" command as it was opening its shared
217libraries and then accessing the /home/vagrant directory listing.
218
219
220Lastly, if a high-frequency event is traced you may overflow the perf ring
221buffer. This shows as "Lost N samples":
222
223# trace sys_open
2245087   5087   pgrep        sys_open
2255087   5087   pgrep        sys_open
2265087   5087   pgrep        sys_open
2275087   5087   pgrep        sys_open
2285087   5087   pgrep        sys_open
229Lost 764896 samples
230Lost 764896 samples
231Lost 764896 samples
232
233The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer
234size and is measured in pages. The value must be a power of two and defaults to
23564 pages.
236
237# trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp
238PID     TID     COMM            FUNC             -
2391855611 1863183 worker          sys_setsockopt   found
240
241In this example we are catching setsockopt syscall to change IPv4 IP_TOS
242value only for the cases where new TOS value is equal to 108. we are using
243STRCMP helper in binary mode (--bin_cmp flag) to compare optval array
244against int value of 108 (parametr of setsockopt call) in hex representation
245(little endian format)
246
247
248
249
250USAGE message:
251
252usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE]
253             [-S] [-M MAX_EVENTS] [-t] [-T] [-K] [-U] [-a] [-I header]
254             probe [probe ...]
255
256Attach to functions and print trace messages.
257
258positional arguments:
259  probe                 probe specifier (see examples)
260
261optional arguments:
262  -h, --help            show this help message and exit
263  -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
264                        number of pages to use for perf_events ring buffer
265                        (default: 64)
266  -p PID, --pid PID     id of the process to trace (optional)
267  -L TID, --tid TID     id of the thread to trace (optional)
268  -v, --verbose         print resulting BPF program code before executing
269  -Z STRING_SIZE, --string-size STRING_SIZE
270                        maximum size to read from strings
271  -S, --include-self    do not filter trace's own pid from the trace
272  -M MAX_EVENTS, --max-events MAX_EVENTS
273                        number of events to print before quitting
274  -t, --timestamp       print timestamp column (offset from trace start)
275  -T, --time            print time column
276  -C, --print_cpu       print CPU id
277  -B, --bin_cmp         allow to use STRCMP with binary values
278  -K, --kernel-stack    output kernel stack trace
279  -U, --user-stack      output user stack trace
280  -a, --address         print virtual address in stacks
281  -I header, --include header
282                        additional header files to include in the BPF program
283                        as either full path, or relative to current working directory,
284                        or relative to default kernel header search path
285
286EXAMPLES:
287
288trace do_sys_open
289        Trace the open syscall and print a default trace message when entered
290trace 'do_sys_open "%s", arg2'
291        Trace the open syscall and print the filename being opened
292trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
293        Trace the read syscall and print a message for reads >20000 bytes
294trace 'r::do_sys_open "%llx", retval'
295        Trace the return from the open syscall and print the return value
296trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
297        Trace the open() call from libc only if the flags (arg2) argument is 42
298trace 'c:malloc "size = %d", arg1'
299        Trace malloc calls and print the size being allocated
300trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
301        Trace the write() call from libc to monitor writes to STDOUT
302trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
303        Trace returns from __kmalloc which returned a null pointer
304trace 'r:c:malloc (retval) "allocated = %x", retval'
305        Trace returns from malloc and print non-NULL allocated buffers
306trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
307        Trace the block_rq_complete kernel tracepoint and print # of tx sectors
308trace 'u:pthread:pthread_create (arg4 != 0)'
309        Trace the USDT probe pthread_create when its 4th argument is non-zero
310trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
311        Trace the nanosleep syscall and print the sleep duration in ns
312trace -I 'linux/fs.h' \
313      'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
314        Trace the uprobe_register inode mapping ops, and the symbol can be found
315        in /proc/kallsyms
316trace -I 'kernel/sched/sched.h' \
317      'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
318        Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
319        in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
320        package.  So this command needs to run at the kernel source tree root directory
321        so that the added header file can be found by the compiler.
322trace -I 'net/sock.h' \\
323      'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
324        Trace udpv6 sendmsg calls only if socket's destination port is equal
325        to 53 (DNS; 13568 in big endian order)
326trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
327        Trace the number of users accessing the file system of the current task
328"
329