eBPF: Tracing Kernel Events for IRIS Workloads

Article

sween · Sep 9, 2024 14m read

#Kubernetes #Monitoring #Performance #Security #InterSystems IRIS

I attended Cloud Native Security Con in Seattle with full intention of crushing OTEL day, then perusing the subject of security applied to Cloud Native workloads the following days leading up to CTF as a professional excercise. This was happily upended by a new understanding of eBPF, which got my screens, career, workloads, and atitude a much needed upgrade with new approaches to solving workload problems.

So I made it to the eBPF party and have been attending clinic after clinic on the subject ever since, here I would like to "unbox" eBPF as a technical solution, mapped directly to what we do in practice (even if its a bit off), and step through eBPF through my experimentation on supporting InterSystems IRIS Workloads, particularly on Kubernetes, but not necessarily void on standalone workloads.

eBee Steps with eBPF and InterSystems IRIS Workloads

eBPF

eBPF (extended Berkeley Packet Filter), is a killer Linux kernel feature that implements a VM within kernel space and makes it possible to run sandboxed apps safely with guiderails. These apps can "map" data into user land for observability, tracing, security, and networking. I think of it as a "sniffer" of the OS, where traditionally it was associated with BPF and networking, and the extended version "sniffing" tracepoints, processes, scheduling, execution and block device access. If you didnt buy my analogy of eBPF, here is one from the pros:

"What JavaScript is to the browser, eBPF is to the Linux Kernel"

JavaScript lets you attach callbacks to events in the DOM in order to bring dynamic features to your web page. In a similar fashion, eBPF allows to hook to kernel events and extend their logic when these events are triggered!

Immediately Applicable

IF; the following prometheus metric seems impossible to you, employ eBPF to watch processes that are supposed to be there and monitor in band through the kernel.

# HELP iris_instance_status The thing thats down telling us its down.
# TYPE iris_instance_status gauge
iris_instance_status 0

IF; you are tired of begging for the following for a sidecar to get needed observability, Goodbye sidecars

  iris-sidecar:    
    resources:
      requests:
        memory: "2Mi"
        cpu: "125m"

Where

One of the most satisfying things about how eBPF is applied, is where it runs... in the a VM, inside the kernel. And thanks to Linux Namespacing, you can guess how powerful that is in a cloud native environment, let alone a kernel sitting in some sort of virtualizing or a big iron ghetto blaster machine with admirational hardware.

Obligatory Hello World

For those of you who like to try things from themselves and from the "beginning" so to speak, I salute you with an obligatory Hello World, twisted to be a tad bit "irisy." However, its mostly undersstood that programming in eBPF wont become a skill that is frequently excercised, but concentrated on individuals who do Linux kernel development or building next generation monitoring tools.

I run Pop OS/Ubuntu, and here is my cheat code to getting into the eBPF world quickly on 23.04:

sudo apt install -y zip bison build-essential cmake flex git libedit-dev \
  libllvm15 llvm-15-dev libclang-15-dev python3 zlib1g-dev libelf-dev libfl-dev python3-setuptools \
  liblzma-dev libdebuginfod-dev arping netperf iperf libpolly-15-dev
git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake ..
make
sudo make install
cmake -DPYTHON_CMD=python3 .. # build python3 binding
pushd ../src/python/
make
sudo make install
popd
cd bcc
make install

First ensure the target kernel has the required stuff...

cat /boot/config-$(uname -r) | grep 'CONFIG_BPF'
CONFIG_BPF=y

If `CONFIG_BPF=y` is in your window somewhere, we are good to go.

What we want to accomplish here with this simple hello world, is to get visibility into when IRIS is doing LInux system calls, without the use of anything but eBPF tooling and the kernel itself.

Here is a good way to go about exploration:

1️⃣ Find a Linux System Call of Interest

sudo ls /sys/kernel/debug/tracing/events/syscalls

For this example, we are going to trap when somebody (modified to trap IRIS), creates a directory through the syscall `sys_enter_mkdir`.

2️⃣ Insert it into the Following Hello World

Your BPF program to load and run is in the variable BPF_SOURCE_CODE, modify it to include the syscall you want to trap.

# Example eBPF program to a Linux kernel tracepoint
# Modified to trap irisdb
# requires bpfcc-tools
# To run: sudo python3 irismadeadir.py
from bcc import BPF
from bcc.utils import printb

BPF_SOURCE_CODE = r"""
TRACEPOINT_PROBE(syscalls, sys_enter_mkdir) {
bpf_trace_printk("Directory was created by IRIS: %s\n", args->pathname);
return 0;
}
"""
bpf = BPF(text = BPF_SOURCE_CODE)

print("Go create a dir with IRIS...")
print("CTRL-C to exit")

while True:
    try:
        (task, pid, cpu, flags, ts, msg) = bpf.trace_fields()
        #print(task.decode("utf-8"))
        if "iris" in task.decode("utf-8"):
            printb(b"%s-%-6d %s" % (task, pid, msg))
    except ValueError:
        continue
    except KeyboardInterrupt:
        break

3️⃣ Load into the Kernel, Run

Create a dir in IRIS

Inspect the trace!

eBPF Powered Binaries

Doesnt take too long when going through the bcc repository and realize that there are plenty of examples, tools and binaries out there that take advantage of eBPF to do fun tracing, and "grep" in this case will suffice to derive some value.

Lets do just that on a start and stop of IRIS with some supplied examples.

execsnoop Trace new processes via exec() syscalls.

This one here tells a tale of the arguments to irisdb on start/stop.

sudo python3 execsnoopy.py | grep iris

iris             3014275 COMM             PID     PPID    RET ARGS
3011645   0 /usr/bin/iris stop IRIS quietly restart
irisstop         3014275 3011645   0 /usr/irissys/bin/irisstop quietly restart
irisdb           3014276 3014275   0 ./irisdb -s/data/IRIS/mgr/ -cV
irisdb           3014277 3014275   0 ./irisdb -s/data/IRIS/mgr/ -U  -B OPT^SHUTDOWN(1)
irisdb           3014279 3014275   0 ./irisdb -s/data/IRIS/mgr/ -cV
irisdb           3014280 3014275   0 ./irisdb -s/data/IRIS/mgr/ -cV
sh               3014281 3014275   0 /bin/sh -c -- /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb           3014282 3014281   0 /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb           3014283 3014275   0 ./irisdb -s/data/IRIS/mgr/ -cV
irisrecov        3014284 3014275   0 ./irisrecov /data/IRIS/mgr/ quietly
iriswdimj        3014678 3014284   0 /usr/irissys/bin/iriswdimj -t
iriswdimj        3014679 3014284   0 /usr/irissys/bin/iriswdimj -j /data/IRIS/mgr/
rm               3014680 3014284   0 /usr/bin/rm -f iris.use
irisdb           3014684 3014275   0 ./irisdb -s/data/IRIS/mgr/ -w/data/IRIS/mgr/ -cd -B -V CLONE^STU
sh               3014685 3014275   0 /bin/sh -c -- /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb           3014686 3014685   0 /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb           3014687 3014275   0 ./irisdb -s/data/IRIS/mgr/ -cV
irisrecov        3014688 3014275   0 ./irisrecov /data/IRIS/mgr/ quietly
iriswdimj        3015082 3014688   0 /usr/irissys/bin/iriswdimj -t
iriswdimj        3015083 3014688   0 /usr/irissys/bin/iriswdimj -j /data/IRIS/mgr/
rm               3015084 3014688   0 /usr/bin/rm -f iris.use
irisdb           3015088 3014275   0 ./irisdb -s/data/IRIS/mgr/ -w/data/IRIS/mgr/ -cc -B -C/data/IRIS/iris.cpf*IRIS
irisdb           3015140 3014275   0 ./irisdb -s/data/IRIS/mgr/ -w/data/IRIS/mgr/ -U  -B -b1024 -Erunlevel=sys/complete QUIET^STU
irisdb           3015142 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 START^MONITOR
irisdb           3015143 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 START^CLNDMN
irisdb           3015144 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 ErrorPurge^Config.Startup
irisdb           3015145 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 START^LMFMON
irisdb           3015146 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 ^RECEIVE
irisdb           3015147 3015146   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p16 SCAN^JRNZIP
irisdb           3015148 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 OneServerJob^STU
irisdb           3015149 3015148   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p19 Master^%SYS.SERVER
irisdb           3015150 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 systemRestart^%SYS.cspServer2
irisdb           3015151 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 SERVERS^STU1
requirements_ch  3015152 3015140   0 /usr/irissys/bin/requirements_check
dirname          3015153 3015152   0 /usr/bin/dirname /usr/irissys/bin/requirements_check
httpd            3015215 3015151   0 /usr/irissys/httpd/bin/httpd -f /data/IRIS/httpd/conf/httpd.conf -d /usr/irissys/httpd -c Listen 52773
irisdb           3015362 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 OnSystemStartup^HS.FHIRServer.Util.SystemStartup
irisdb           3015363 3015140   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 OnSystemStartup^HS.HC.Util.SystemStartup
irisdb           3015364 3015151   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 RunManager^%SYS.Task
irisdb           3015365 3015151   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 Start^%SYS.Monitor.Control
irisdb           3015366 3015151   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 Daemon^LOGDMN
irisdb           3015367 3015151   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 RunDaemon^%SYS.WorkQueueMgr
irisdb           3015368 3015151   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 RunRemoteQueueDaemon^%SYS.WorkQueueMgr
irisdb           3015369 3015362   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p19 RunAll^HS.HC.Util.Installer.Upgrade.BackgroundItem
irisdb           3015370 3014275   0 ./irisdb -s/data/IRIS/mgr/ -cV
irisdb           3015436 3015367   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p25 startWork^%SYS.WorkQueueMgr
irisdb           3015437 3015367   0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p25 startWork^%SYS.WorkQueueMgr

statsnoop Trace stat() syscalls... returns file attributes about an inode, file/dir access.

This one here is informative to dir and file level access during a start/stop... a bit chatty, but informative to what iris is doing during startup, including cpf access, journals, wij activity and the use of system tooling to get the job done.

sudo python3 statsnoop.py | grep iris

3016831 irisdb              0   0 /data/IRIS/mgr/
3016831 irisdb              0   0 /data/IRIS/mgr/
3016825 irisstop            0   0 /data/IRIS/mgr
3016825 irisstop            0   0 /usr/irissys/bin/irisuxsusr
3016825 irisstop            0   0 ./irisdb
3016825 irisstop            0   0 ../bin
3016832 sh                 -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016832 sh                 -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016832 sh                  0   0 /usr/irissys/bin/
3016832 sh                  0   0 /home/irisowner
3016833 irisdb             -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016833 irisdb             -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016833 irisdb              0   0 /usr/irissys/bin/
3016833 irisdb              0   0 /data/IRIS/mgr/
3016833 irisdb              0   0 /data/IRIS/mgr/
3016833 irisdb              0   0 /data/IRIS/mgr/
3016834 irisstop            0   0 ./irisdb
3016834 irisstop            0   0 ../bin
3016834 irisdb             -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016834 irisdb             -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016834 irisdb              0   0 /usr/irissys/bin/
3016834 irisdb              0   0 /data/IRIS/mgr/
3016834 irisdb              0   0 /data/IRIS/mgr/
3016835 irisstop            0   0 ./irisrecov
3016835 irisstop            0   0 ../bin
3016835 irisrecov          -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016835 irisrecov          -1   2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016835 irisrecov           0   0 /usr/irissys/bin/
3016835 irisrecov           0   0 /home/irisowner
3016835 irisrecov           0   0 .
3016835 irisrecov           0   0 iris.cpf
3016841 irisrecov           0   0 /usr/bin/cut
3016841 irisrecov           0   0 /usr/bin/tr
3016841 irisrecov           0   0 /usr/bin/sed
3017761 requirements_ch     0   0 /home/irisowner
3017761 requirements_ch    -1   2 /usr/irissys/bin/requirements.isc
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb              0   0 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /usr/lib64/libcrypto.so.1.1
3017691 irisdb             -1   2 /usr/lib64/libcrypto.so.3
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/iris.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf_20240908
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb             -1   2 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /etc/localtime
3017691 irisdb              0   0 /data/IRIS/_LastGood_.cpf
3017691 irisdb              0   0 /data/IRIS/mgr/irisaudit/
3017691 irisdb              0   0 /data/IRIS/mgr/irisaudit/
3017691 irisdb              0   0 /data/IRIS/mgr/irisaudit/
3017691 irisdb              0   0 /data/IRIS/mgr/irisaudit/
3017691 irisdb              0   0 /data/IRIS/mgr/irisaudit/
3017691 irisdb              0   0 /data/IRIS/mgr/irisaudit/
3017756 irisdb             -1   2 /data/IRIS/mgr/journal/20240908.002
3017756 irisdb             -1   2 /data/IRIS/mgr/journal/20240908.002
3017756 irisdb              0   0 /data/IRIS/mgr/journal/20240908.002z
3017756 irisdb             -1   2 /data/IRIS/mgr/journal/20240908.002
3017756 irisdb              0   0 /data/IRIS/mgr/journal/20240908.002z
3017756 irisdb             -1   2 /data/IRIS/mgr/journal/20240908.001

Flamegraphs

One of the coolest things I stumbled upon with the eBPF tooling was Brendan Gregg's implementation of flamegraphs on top of bpf output to understand performance and stack traces.

Given the following perf recording during a start/stop of IRIS:

sudo perf record -F 99 -a -g -- sleep 60
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 3.701 MB perf.data (15013 samples) ]

Generate the following flame graph with the below:

sudo perf script > out.perf
./stackcollapse-perf.pl out.perf > /tmp/gar.thing
./flamegraph.pl /tmp/gar.thing > flamegraph.svg

I gave it the college try uploading the svg, but it did not work out with this editor, and for some reason was unable to attach it. Understand though it is interactive and clickable to drill down into stack traces, outside of just looking cool.

The function on the bottom is the function on-CPU. The higher up the y-axis, the further nested the function.
The width of each function on the graph represents the amount of time that function took to execute as a percentage of the total time of its parent function.
Finding functions that are both high on the y-axis (deeply nested) and wide on the x-axis (time-intensive) is a great way to narrow down performance and optimization issues.

"high and wide" <--- 👀

red == user-level

orange == kernel

yellow == c++

green == JIT, java etc.

I really liked this explanation of flamegraph interpetation laid out here (credit for above) where I derived a baseline understanding on how to read flamegraphs. Especially powerful for those who are running Python in IRIS on productions with userland code and looking for optimization.

Onward and upward, I hope this piqued your interest, now lets move on to the world of eBPF apps, where the pros have put together phenomanal solutions to put eBPF to work on fleets of systems safely and in a lightweight manner.