1Demonstrations of trace. 2 3 4trace probes functions you specify and displays trace messages if a particular 5condition is met. You can control the message format to display function 6arguments and return values. 7 8For example, suppose you want to trace all commands being exec'd across the 9system: 10 11# trace 'sys_execve "%s", arg1' 12PID COMM FUNC - 134402 bash sys_execve /usr/bin/man 144411 man sys_execve /usr/local/bin/less 154411 man sys_execve /usr/bin/less 164410 man sys_execve /usr/local/bin/nroff 174410 man sys_execve /usr/bin/nroff 184409 man sys_execve /usr/local/bin/tbl 194409 man sys_execve /usr/bin/tbl 204408 man sys_execve /usr/local/bin/preconv 214408 man sys_execve /usr/bin/preconv 224415 nroff sys_execve /usr/bin/locale 234416 nroff sys_execve /usr/bin/groff 244418 groff sys_execve /usr/bin/grotty 254417 groff sys_execve /usr/bin/troff 26^C 27 28The ::sys_execve syntax specifies that you want an entry probe (which is the 29default), in a kernel function (which is the default) called sys_execve. Next, 30the format string to print is simply "%s", which prints a string. Finally, the 31value to print is the first argument to the sys_execve function, which happens 32to be the command that is exec'd. The above trace was generated by executing 33"man ls" in a separate shell. As you see, man executes a number of additional 34programs to finally display the man page. 35 36Next, suppose you are looking for large reads across the system. Let's trace 37the read system call and inspect the third argument, which is the number of 38bytes to be read: 39 40# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' 41PID COMM FUNC - 424490 dd sys_read read 1048576 bytes 434490 dd sys_read read 1048576 bytes 444490 dd sys_read read 1048576 bytes 454490 dd sys_read read 1048576 bytes 46^C 47 48During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4". 49The individual reads are visible, with the custom format message printed for 50each read. The parenthesized expression "(arg3 > 20000)" is a filter that is 51evaluated for each invocation of the probe before printing anything. 52 53You can also trace user functions. For example, let's simulate the bashreadline 54script, which attaches to the readline function in bash and prints its return 55value, effectively snooping all bash shell input across the system: 56 57# trace 'r:bash:readline "%s", retval' 58PID COMM FUNC - 592740 bash readline echo hi! 602740 bash readline man ls 61^C 62 63The special retval keyword stands for the function's return value, and can 64be used only in a retprobe, specified by the 'r' prefix. The next component 65of the probe is the library that contains the desired function. It's OK to 66specify executables too, as long as they can be found in the PATH. Or, you 67can specify the full path to the executable (e.g. "/usr/bin/bash"). 68 69Sometimes it can be useful to see where in code the events happen. There are 70flags to print the kernel stack (-K), the user stack (-U) and optionally 71include the virtual address in the stacks as well (-a): 72 73# trace.py -U -a 'r::sys_futex "%d", retval' 74PID TID COMM FUNC - 75793922 793951 poller sys_futex 0 76 7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so] 77 627fef folly::FunctionScheduler::run()+0x46f [router] 78 7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21] 79 7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so] 80 7f6c7223fa7d clone+0x6d [libc-2.23.so] 81 82Multiple probes can be combined on the same command line. For example, let's 83trace failed read and write calls on the libc level, and include a time column: 84 85# trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \ 86 'r:c:write ((int)retval < 0) "write failed: %d", retval' -T 87TIME PID COMM FUNC - 8805:31:57 3388 bash write write failed: -1 8905:32:00 3388 bash write write failed: -1 90^C 91 92Note that the retval variable must be cast to int before comparing to zero. 93The reason is that the default type for argN and retval is an unsigned 64-bit 94integer, which can never be smaller than 0. 95 96trace has also some basic support for kernel tracepoints. For example, let's 97trace the block:block_rq_complete tracepoint and print out the number of sectors 98transferred: 99 100# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T 101TIME PID COMM FUNC - 10201:23:51 0 swapper/0 block_rq_complete sectors=8 10301:23:55 10017 kworker/u64: block_rq_complete sectors=1 10401:23:55 0 swapper/0 block_rq_complete sectors=8 105^C 106 107To discover the tracepoint structure format (which you can refer to as the "args" 108pointer variable), use the tplist tool. For example: 109 110# tplist -v block:block_rq_complete 111block:block_rq_complete 112 dev_t dev; 113 sector_t sector; 114 unsigned int nr_sector; 115 int errors; 116 char rwbs[8]; 117 118This output tells you that you can use "args->dev", "args->sector", etc. in your 119predicate and trace arguments. 120 121 122More and more high-level libraries are instrumented with USDT probe support. 123These probes can be traced by trace just like kernel tracepoints. For example, 124trace new threads being created and their function name, include time column 125and on which CPU it happened: 126 127# trace 'u:pthread:pthread_create "%U", arg3' -T -C 128TIME CPU PID TID COMM FUNC - 12913:22:01 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] 13013:22:01 5 21360 21414 osqueryd pthread_create [unknown] [osqueryd] 13113:22:03 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] 13213:22:04 15 21360 21414 osqueryd pthread_create [unknown] [osqueryd] 13313:22:07 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] 13413:22:07 4 21360 21414 osqueryd pthread_create [unknown] [osqueryd] 135^C 136 137The "%U" format specifier tells trace to resolve arg3 as a user-space symbol, 138if possible. Similarly, use "%K" for kernel symbols. 139 140Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's 141trace Ruby methods being called (this requires a version of Ruby built with 142the --enable-dtrace configure flag): 143 144# trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T 145TIME PID COMM FUNC - 14612:08:43 18420 irb method__entry IRB::Context.verbose? 14712:08:43 18420 irb method__entry RubyLex.ungetc 14812:08:43 18420 irb method__entry RuxyLex.debug? 149^C 150 151In the previous invocation, arg1 and arg2 are the class name and method name 152for the Ruby method being invoked. 153 154You can also trace exported functions from shared libraries, or an imported 155function on the actual executable: 156 157# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval' 158# tput -V 159 160PID TID COMM FUNC - 16121720 21720 tput curses_version Version=ncurses 6.0.20160709 162^C 163 164 165Occasionally, it can be useful to filter specific strings. For example, you 166might be interested in open() calls that open a specific file: 167 168# trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T 169TIME PID COMM FUNC - 17001:43:15 10938 cat open opening test.txt 17101:43:20 10939 cat open opening test.txt 172^C 173 174 175In the preceding example, as well as in many others, readability may be 176improved by providing the function's signature, which names the arguments and 177lets you access structure sub-fields, which is hard with the "arg1", "arg2" 178convention. For example: 179 180# trace 'p:c:open(char *filename) "opening %s", filename' 181PID TID COMM FUNC - 18217507 17507 cat open opening FAQ.txt 183^C 184 185# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' 186PID TID COMM FUNC - 187777 785 automount SyS_nanosleep sleep for 500000000 ns 188777 785 automount SyS_nanosleep sleep for 500000000 ns 189777 785 automount SyS_nanosleep sleep for 500000000 ns 190777 785 automount SyS_nanosleep sleep for 500000000 ns 191^C 192 193Remember to use the -I argument include the appropriate header file. We didn't 194need to do that here because `struct timespec` is used internally by the tool, 195so it always includes this header file. 196 197 198As a final example, let's trace open syscalls for a specific process. By 199default, tracing is system-wide, but the -p switch overrides this: 200 201# trace -p 2740 'do_sys_open "%s", arg2' -T 202TIME PID COMM FUNC - 20305:36:16 15872 ls do_sys_open /etc/ld.so.cache 20405:36:16 15872 ls do_sys_open /lib64/libselinux.so.1 20505:36:16 15872 ls do_sys_open /lib64/libcap.so.2 20605:36:16 15872 ls do_sys_open /lib64/libacl.so.1 20705:36:16 15872 ls do_sys_open /lib64/libc.so.6 20805:36:16 15872 ls do_sys_open /lib64/libpcre.so.1 20905:36:16 15872 ls do_sys_open /lib64/libdl.so.2 21005:36:16 15872 ls do_sys_open /lib64/libattr.so.1 21105:36:16 15872 ls do_sys_open /lib64/libpthread.so.0 21205:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive 21305:36:16 15872 ls do_sys_open /home/vagrant 214^C 215 216In this example, we traced the "ls ~" command as it was opening its shared 217libraries and then accessing the /home/vagrant directory listing. 218 219 220Lastly, if a high-frequency event is traced you may overflow the perf ring 221buffer. This shows as "Lost N samples": 222 223# trace sys_open 2245087 5087 pgrep sys_open 2255087 5087 pgrep sys_open 2265087 5087 pgrep sys_open 2275087 5087 pgrep sys_open 2285087 5087 pgrep sys_open 229Lost 764896 samples 230Lost 764896 samples 231Lost 764896 samples 232 233The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer 234size and is measured in pages. The value must be a power of two and defaults to 23564 pages. 236 237# trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp 238PID TID COMM FUNC - 2391855611 1863183 worker sys_setsockopt found 240 241In this example we are catching setsockopt syscall to change IPv4 IP_TOS 242value only for the cases where new TOS value is equal to 108. we are using 243STRCMP helper in binary mode (--bin_cmp flag) to compare optval array 244against int value of 108 (parametr of setsockopt call) in hex representation 245(little endian format) 246 247 248 249 250USAGE message: 251 252usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE] 253 [-S] [-M MAX_EVENTS] [-t] [-T] [-K] [-U] [-a] [-I header] 254 probe [probe ...] 255 256Attach to functions and print trace messages. 257 258positional arguments: 259 probe probe specifier (see examples) 260 261optional arguments: 262 -h, --help show this help message and exit 263 -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES 264 number of pages to use for perf_events ring buffer 265 (default: 64) 266 -p PID, --pid PID id of the process to trace (optional) 267 -L TID, --tid TID id of the thread to trace (optional) 268 -v, --verbose print resulting BPF program code before executing 269 -Z STRING_SIZE, --string-size STRING_SIZE 270 maximum size to read from strings 271 -S, --include-self do not filter trace's own pid from the trace 272 -M MAX_EVENTS, --max-events MAX_EVENTS 273 number of events to print before quitting 274 -t, --timestamp print timestamp column (offset from trace start) 275 -T, --time print time column 276 -C, --print_cpu print CPU id 277 -B, --bin_cmp allow to use STRCMP with binary values 278 -K, --kernel-stack output kernel stack trace 279 -U, --user-stack output user stack trace 280 -a, --address print virtual address in stacks 281 -I header, --include header 282 additional header files to include in the BPF program 283 as either full path, or relative to current working directory, 284 or relative to default kernel header search path 285 286EXAMPLES: 287 288trace do_sys_open 289 Trace the open syscall and print a default trace message when entered 290trace 'do_sys_open "%s", arg2' 291 Trace the open syscall and print the filename being opened 292trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' 293 Trace the read syscall and print a message for reads >20000 bytes 294trace 'r::do_sys_open "%llx", retval' 295 Trace the return from the open syscall and print the return value 296trace 'c:open (arg2 == 42) "%s %d", arg1, arg2' 297 Trace the open() call from libc only if the flags (arg2) argument is 42 298trace 'c:malloc "size = %d", arg1' 299 Trace malloc calls and print the size being allocated 300trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3' 301 Trace the write() call from libc to monitor writes to STDOUT 302trace 'r::__kmalloc (retval == 0) "kmalloc failed!"' 303 Trace returns from __kmalloc which returned a null pointer 304trace 'r:c:malloc (retval) "allocated = %x", retval' 305 Trace returns from malloc and print non-NULL allocated buffers 306trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' 307 Trace the block_rq_complete kernel tracepoint and print # of tx sectors 308trace 'u:pthread:pthread_create (arg4 != 0)' 309 Trace the USDT probe pthread_create when its 4th argument is non-zero 310trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' 311 Trace the nanosleep syscall and print the sleep duration in ns 312trace -I 'linux/fs.h' \ 313 'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops' 314 Trace the uprobe_register inode mapping ops, and the symbol can be found 315 in /proc/kallsyms 316trace -I 'kernel/sched/sched.h' \ 317 'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining' 318 Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined 319 in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel 320 package. So this command needs to run at the kernel source tree root directory 321 so that the added header file can be found by the compiler. 322trace -I 'net/sock.h' \\ 323 'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)' 324 Trace udpv6 sendmsg calls only if socket's destination port is equal 325 to 53 (DNS; 13568 in big endian order) 326trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users' 327 Trace the number of users accessing the file system of the current task 328" 329