Skip to content

Latest commit

 

History

History
637 lines (459 loc) · 24.8 KB

File metadata and controls

637 lines (459 loc) · 24.8 KB

BPF Tracing - Observability of System Performance

1. Overview

eBPF (The Extended Berkeley Packet Filter) is an in-kernel virtual machine that allows code execution in the kernel space, in a restricted sandbox environment with access to a limited set of functions. There are numerous components shipped by Red Hat that utilize the eBPF virtual machine. Each component is in a different development phase, and thus not all components are currently fully supported. Those that are not supported are available as a Technology Preview and are not intended for production use.For information on Red Hat scope of support for Technology Preview features, see: Technology Preview Features Support Scope

In this unit, we will be focusing on the kernel tracing offering shipped as bcc-tools (BPF Compiler Collection Tools). There are just over 100 tools shipped in this package. A few things of note:

  • All of these tools live in /usr/share/bcc/tools.

  • These tools must run as the root user as any eBPF program can read kernel data. As such, injecting eBPF bytecode as a regular user is not allowed in RHEL 8.

  • Each tool has a man page to view the man page, run man <tool name>. These man pages include descriptions of the tools, provide the options that can be called and have information on the expected overhead of the specific tool.

2. Getting Started

For these exercises, you will primarily be using the host node2 as user root. However, these exercises do REQUIRE a second terminal session to run commands on other hosts. Please pay careful attention to what commands to run on different hosts.

💡
Now is a great time to use the multi session capability of tmux. Use Ctrl-b " to create another session with a split screen. Cycle the active session back and forth with CTRL-b n (next) and CTRL-b p (previous).

From host bastion, ssh to node2.

ssh node2

Use sudo to elevate your priviledges.

sudo -i

Verify that you are on the right host for these exercises.

workshop-ebpf-checkhost.sh

You are now ready to proceed with these exercises.

3. Installation

Start by installing the BPF Compiler Collection (bcc-tools) and kernel-devel packages for the installed kernel:

yum install bcc-tools kernel-devel-$(uname -r) -y

Now, we have a lot of interesting tools installed in /usr/share/bcc/tools along with accompanying man pages.

4. bcc-tools: tcplife

The tcplife tool monitors tcp connections on the system and measures the amount of time that they’ve been open, the local and remote ports involved in the connection and the amount of data transferred and received in kilobytes.

Primary Terminal

To run this tool, execute the following and give it a couple of seconds to get set up:

/usr/share/bcc/tools/tcplife
PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS

Secondary Terminal

In your second terminal, establish yourself as described below and then run the following command:

  • host: bastion

  • user: root

ssh node2 "sudo workshop-ebpf-rootkit.sh"

This command should run for about 10 seconds and then exit.

Primary Terminal

In the tcplife terminal, you should now see output similar to:

PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
28478 sshd       10.0.0.12       22    10.0.0.10       33514    14     2 9091.95

Doing the math with my numbers above, my session was open for about 9 seconds and was initiated by IP 10.0.0.10, the bastion.

In the tcplife terminal, issue a Ctrl-C and this will return you to a prompt.

5. bcc-tools: execsnoop

The execsnoop script monitors all calls to execve() and catches all processes that follow the fork→exec sequence, as well as processes that re-exec() themselves. Processes that fork() but do not exec() won’t be caught by this script.

Primary Terminal

Continuing on host node2, to run execsnoop as follows:

/usr/share/bcc/tools/execsnoop
PCOMM            PID    PPID   RET ARGS

Secondary Terminal

In your second terminal, establish yourself as described below and then run the following command:

  • host: bastion

  • user: root

ssh node2 "sudo workshop-ebpf-rootkit.sh"

Primary Terminal

In the execsnoop terminal, you should see output similar to:

PCOMM            PID    PPID   RET ARGS
sshd             28512  749      0 /usr/sbin/sshd -D [email protected],[email protected],aes256-ctr,aes256-cbc,[email protected],aes128-ctr,aes128-cb [email protected],[email protected],[email protected],[email protected],hmac-sha2- -oGSSAPIKexAlgorithms=gss-gex-sha1-,gss-group14-sha1- [email protected],ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-excha -oHostKeyAlgorithms=rsa-sha2-256,ecdsa-sha2-nistp256,[email protected],ecdsa-sha2-nistp384,ecdsa-sha2-nis -oPubkeyAcceptedKeyTypes=rsa-sha2-256,ecdsa-sha2-nistp256,[email protected],ecdsa-sha2-nistp384,ecdsa-sha -R
unix_chkpwd      28514  28512    0 /usr/sbin/unix_chkpwd root chkexpiry
bash             28516  28515    0 /bin/bash -c workshop-ebpf-rootkit.sh
grepconf.sh      28517  28516    0 /usr/libexec/grepconf.sh -c
grep             28518  28517    0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS
grepconf.sh      28519  28516    0 /usr/libexec/grepconf.sh -c
grep             28520  28519    0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS
grepconf.sh      28521  28516    0 /usr/libexec/grepconf.sh -c
grep             28522  28521    0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS
sed              28524  28523    0 /usr/bin/sed -r -e s/^[[:blank:]]*([[:upper:]_]+)=([[:print:][:digit:]\._-]+|"[[:print:][:digit:]\._-]+")/export \1=\2/;t;d /etc/locale.conf
uname            28525  28516    0 /usr/bin/uname -a
sleep            28526  28516    0 /usr/bin/sleep 1
who              28527  28516    0 /usr/bin/who
sleep            28528  28516    0 /usr/bin/sleep 1
grep             28530  28516    0 /usr/bin/grep root /etc/passwd
sleep            28531  28516    0 /usr/bin/sleep 1
grep             28532  28516    0 /usr/bin/grep root /etc/shadow
sleep            28533  28516    0 /usr/bin/sleep 1
cat              28534  28516    0 /usr/bin/cat /etc/fstab
sleep            28535  28516    0 /usr/bin/sleep 1
ps               28536  28516    0 /usr/bin/ps -ef
sleep            28537  28516    0 /usr/bin/sleep 1
netstat          28538  28516    0 /usr/bin/netstat -tulpn
sleep            28539  28516    0 /usr/bin/sleep 1
getenforce       28540  28516    0 /usr/sbin/getenforce
sleep            28541  28516    0 /usr/bin/sleep 1
firewall-cmd     28542  28516    0 /usr/bin/firewall-cmd --state

This shows you all the processes that ran exec() during that ssh login, their PID, their parent PID, their return code, and the arguments that were sent to the process. You could keep monitoring this for quite some time to catch potential bad actors on the system.

In the execsnoop terminal, issue a Ctrl-C and this will return you to a prompt.

6. bcc-tools: mountsnoop

Similar in nature to execsnoop, mountsnoop traces the mount() and umount() syscalls which show processes that are attempting to mount (or unmount) filesystems.

Primary Terminal

To run this tool, execute the following and give it a couple of seconds to get set up:

/usr/share/bcc/tools/mountsnoop
COMM             PID     TID     MNT_NS      CALL

Secondary Terminal

PAY ATTENTION!!! USING A DIFFERENT HOST

In your second terminal, establish yourself as described below and then run the following command:

  • host: node2

  • user: root

Let’s try to unmount a mount that we know cannot be unmounted. For this, we’ll pick the root filesystem '/'.

umount /
umount: /: target is busy.

Primary Terminal

Taking a look at the terminal running mountsnoop, we see:

umount           20001   20001   4026531840  umount("/", 0x0) = -EBUSY

This shows us that the mount is busy and cannot be unmounted.

Secondary Terminal

Now let’s try to unmount a filesystem that we should be able to unmount. But before doing so, look at the mount options to ensure we can restore it correctly. On node2 run the following:

grep /dev/shm /proc/mounts
tmpfs /dev/shm tmpfs *rw,seclabel,nosuid,nodev,relatime* 0 0

Now proceed to umount /dev/shm on node2

umount /dev/shm

Primary Terminal

Back to the mountsnoop terminal and you should see the following:

umount           20003   20003   4026531840  umount("/dev/shm", 0x0) = 0

The umount command succeeded.

Secondary Terminal

Proceed to restore the /dev/shm mount as follows:

mount -t tmpfs -o rw,nosuid,nodev,relatime,seclabel tmpfs /dev/shm

Primary Terminal

Finally, back to the mountsnoop terminal and you should see the following:

mount            20004   20004   4026531840  mount("tmpfs", "/dev/shm", "tmpfs", MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_SYNCHRONOUS|MS_DIRSYNC|MS_NOATIME|MS_NODIRATIME|MS_MOVE|MS_REC|MS_UNBINDABLE|MS_SLAVE|MS_SHARED|MS_I_VERSION|MS_STRICTATIME|MS_LAZYTIME|MS_NOUSER|0x7f2b30000000, "") = 0

This shows us that the mount succeeded and all the options that were passed into the system call.

As you can see, the mountsnoop tool is very useful for seeing what processes are calling the mount and umount system calls and what the results of those calls are.

In the mountsnoop terminal, issue a Ctrl-C and this will return you to a prompt.

7. bcc-tools: xfsslower

⚠️
Please verify the filesystem your host is using with the command df -T /. If your host is configured with ext4, then substitute the command ext4slower in place of xfsslower.

The purpose of the xfsslower tool (also ext4slower and nfsslower) is to show you filesystem operations slower than a particular threshold, that defaults to 10ms. It traces reads, writes, opens, and syncs and then prints out the timestamp of the operation, the process name, the ID, the type of operation, the file offset in kilobytes, the latency of the I/O measured from when it was issued by VFS to the filesystem to when it was completed, and finally, the filename being operated on.

Primary Terminal

To run this tool, execute the following and give it a couple of seconds to get set up:

/usr/share/bcc/tools/xfsslower
Tracing xfs operations slower than 10 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME

Secondary Terminal

In your second terminal, establish yourself as described below and then run the following command:

  • host: node2

  • user: root

dd if=/dev/urandom of=bigfile bs=1024 count=20000

This writes out a 20M file called bigfile and should not register on your xfsslower window.

Now, let’s execute the above command in a for loop so that we get more I/O going in parallel:

for i in $(seq 1 10); do dd if=/dev/urandom of=bigfile$i bs=1024 count=20000 & done

Primary Terminal

Now you should see similar output in your xfsslower window:

TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
20:44:43 b'dd'          32446  W 1024    778        44.11 b'bigfile1'
20:44:43 b'dd'          32455  W 1024    818        55.11 b'bigfile10'
20:44:43 b'dd'          32452  W 1024    1712       44.11 b'bigfile7'
20:44:43 b'dd'          32455  W 1024    1778       55.02 b'bigfile10'
20:44:43 b'dd'          32451  W 1024    2850       44.11 b'bigfile6'
20:44:43 b'dd'          32447  W 1024    3598       44.10 b'bigfile2'
20:44:43 b'dd'          32451  W 1024    3805       55.11 b'bigfile6'
20:44:43 b'dd'          32446  W 1024    4612       44.28 b'bigfile1'
20:44:43 b'dd'          32446  W 1024    5529       33.01 b'bigfile1'
20:44:43 b'dd'          32454  W 1024    4504       55.11 b'bigfile9'
20:44:43 b'dd'          32447  W 1024    7335       44.10 b'bigfile2'
20:44:43 b'dd'          32455  W 1024    7545       44.02 b'bigfile10'
20:44:43 b'dd'          32446  W 1024    8344       49.16 b'bigfile1'
20:44:43 b'dd'          32448  W 1024    8183       44.18 b'bigfile3'
20:44:43 b'dd'          32447  W 1024    9168       55.10 b'bigfile2'
20:44:43 b'dd'          32449  W 1024    9728       54.10 b'bigfile4'
20:44:43 b'dd'          32454  W 1024    10244      33.11 b'bigfile9'
20:44:43 b'dd'          32447  W 1024    10989      55.02 b'bigfile2'
20:44:43 b'dd'          32453  W 1024    11276      54.10 b'bigfile8'
20:44:43 b'dd'          32453  W 1024    12169      33.10 b'bigfile8'
20:44:43 b'dd'          32451  W 1024    13292      91.11 b'bigfile6'
20:44:43 b'dd'          32453  W 1024    13108      47.24 b'bigfile8'
20:44:43 b'dd'          32448  W 1024    13788      44.01 b'bigfile3'
20:44:43 b'dd'          32454  W 1024    14137      44.23 b'bigfile9'
20:44:43 b'dd'          32446  W 1024    16076      44.02 b'bigfile1'
20:44:43 b'dd'          32447  W 1024    15796      44.26 b'bigfile2'
20:44:44 b'dd'          32446  W 1024    17004      44.10 b'bigfile1'
20:44:44 b'dd'          32455  W 1024    16697      44.16 b'bigfile10'
20:44:44 b'dd'          32450  W 1024    18505      44.01 b'bigfile5'
20:44:44 b'dd'          32451  W 1024    19056      44.17 b'bigfile6'
20:44:44 b'dd'          32446  W 1024    19868      44.38 b'bigfile1'
20:44:44 b'dd'          32452  W 1024    19272      44.14 b'bigfile7'
20:44:44 b'dd'          32455  W 1024    19168      30.75 b'bigfile10'
20:44:44 b'dd'          32453  W 1024    19612      31.16 b'bigfile8'
20:44:44 b'dd'          32454  W 1024    19460      24.59 b'bigfile9'
20:44:44 b'dd'          32447  W 1024    19508      36.20 b'bigfile2'

So we can see that when writing these files in parallel, we have xfs operations taking longer than 10ms to complete.

In the xfsslower terminal, issue a Ctrl-C and this will return you to a prompt.

8. bcc-tools: cachestat

The cachestat tool traces kernel page cache functions and prints every five second summaries to aid you in workload characterization.

Primary Terminal

To run this tool, execute the following and give it a couple of seconds to get set up:

/usr/share/bcc/tools/cachestat
   TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB

Secondary Terminal

In your second terminal, establish yourself as described below and then run the following command:

  • host: node2

  • user: root

echo 3 > /proc/sys/vm/drop_caches
for i in $(seq 1 10); do dd if=/dev/urandom of=bigfile$i bs=1024 count=20000 & done

This flushes the cache and then runs a series of dd commands to create some I/O.

Primary Terminal

In the cachestat window, you should output similar to:

   TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB
       0        0        0        0            0        154
   14773      901    13872    44133            1        200

This shows that we had 901 page cache misses during a five second period while running the above loop, but during that same second, there were 9,821 hits, indicating great performance from the page cache.

In the cachestat terminal, issue a Ctrl-C and this will return you to a prompt.

9. bcc-tools: trace

This tool is a swiss army knife allowing you to specify functions to trace and messages to be printed when certain conditions are met. You can read more about this by running:

man 8 trace

Let’s do a simple trace in which we will dynamically trace the do_sys_open() kernel function and print the names of the files opened.

Primary Terminal

To run this tool, execute the following and give it a couple of seconds to get set up:

/usr/share/bcc/tools/trace 'p::do_sys_open "%s", arg2'
PID     TID     COMM            FUNC             -

Secondary Terminal

In your second terminal, establish yourself as described below and then run the following command:

  • host: node2

  • user: root

cat /etc/fstab

Primary Terminal

In the trace terminal, you will see something similar to:

32559   32559   cat             do_sys_open      b'/etc/ld.so.cache'
32559   32559   cat             do_sys_open      b'/lib64/libc.so.6'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/locale-archive'
32559   32559   cat             do_sys_open      b'/usr/share/locale/locale.alias'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_IDENTIFICATION'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION'
32559   32559   cat             do_sys_open      b'/usr/lib64/gconv/gconv-modules.cache'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_MEASUREMENT'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_MEASUREMENT'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_TELEPHONE'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_TELEPHONE'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_ADDRESS'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_ADDRESS'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_NAME'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_NAME'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_PAPER'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_PAPER'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_MESSAGES'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_MESSAGES'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_MONETARY'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_MONETARY'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_COLLATE'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_COLLATE'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_TIME'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_TIME'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_NUMERIC'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_NUMERIC'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.UTF-8/LC_CTYPE'
32559   32559   cat             do_sys_open      b'/usr/lib/locale/en_US.utf8/LC_CTYPE'
32559   32559   cat             do_sys_open      b'/etc/fstab'

Go ahead and Ctrl-C trace and then let’s do one more trace, this time, tracing the return values out of trace:

/usr/share/bcc/tools/trace 'r::do_sys_open "ret: %d", retval'

Secondary Terminal

cat /etc/fstab
PID     TID     COMM            FUNC             -

Primary Terminal

In the trace terminal, you will see something similar to:

PID     TID     COMM            FUNC             -
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: -2
32576   32576   cat             do_sys_open      ret: 3
32576   32576   cat             do_sys_open      ret: 3

Go to the terminal with trace running and issue a Ctrl-C. This will end the process and return you to the command line.

You may also now type exit in one of your tmux panes to return to having just one pane visible.

There is a lot more that you can do with this tool when you actually need to start tracing what is getting passed into kernel functions and what is being returned by those kernel functions.