Article
· Sep 9 14m read

eBPF: Tracing Kernel Events for IRIS Workloads

 

I attended Cloud Native Security Con in Seattle with full intention of crushing OTEL day, then perusing the subject of security applied to Cloud Native workloads the following days leading up to CTF as a professional excercise. This was happily upended by a new understanding of eBPF, which got my screens, career, workloads, and atitude a much needed upgrade with new approaches to solving workload problems. 

So I made it to the eBPF party and have been attending clinic after clinic on the subject ever since, here I would like to "unbox" eBPF as a technical solution, mapped directly to what we do in practice (even if its a bit off), and step through eBPF through my experimentation on supporting InterSystems IRIS Workloads, particularly on Kubernetes, but not necessarily void on standalone workloads.

eBee Steps with eBPF and InterSystems IRIS Workloads


 

eBPF

eBPF (extended Berkeley Packet Filter), is a killer Linux kernel feature that implements a VM within kernel space and makes it possible to run sandboxed apps safely with guiderails. These apps can "map" data into user land for observability, tracing, security, and networking. I think of it as a "sniffer" of the OS, where traditionally it was associated with BPF and networking, and the extended version "sniffing" tracepoints, processes, scheduling, execution and block device access.  If you didnt buy my analogy of eBPF, here is one from the pros:

"What JavaScript is to the browser, eBPF is to the Linux Kernel"

JavaScript lets you attach callbacks to events in the DOM in order to bring dynamic features to your web page. In a similar fashion, eBPF allows to hook to kernel events and extend their logic when these events are triggered!

Immediately Applicable

IF;  the following prometheus metric seems impossible to you, employ eBPF to watch processes that are supposed to be there and monitor in band through the kernel.

# HELP iris_instance_status The thing thats down telling us its down.
# TYPE iris_instance_status gauge
iris_instance_status 0
    

IF; you are tired of begging for the following for a sidecar to get needed observability, Goodbye sidecars

  iris-sidecar:    
    resources:
      requests:
        memory: "2Mi"
        cpu: "125m"

Where

One of the most satisfying things about how eBPF is applied, is where it runs... in the a VM, inside the kernel.  And thanks to Linux Namespacing, you can guess how powerful that is in a cloud native environment, let alone a kernel sitting in some sort of virtualizing or a big iron ghetto blaster machine with admirational hardware.

 

Obligatory Hello World

For those of you who like to try things from themselves and from the "beginning" so to speak, I salute you with an obligatory Hello World, twisted to be a tad bit "irisy."  However, its mostly undersstood that programming in eBPF wont become a skill that is frequently excercised, but concentrated on individuals who do Linux kernel development or building next generation monitoring tools.

I run Pop OS/Ubuntu, and here is my cheat code to getting into the eBPF world quickly on 23.04:

sudo apt install -y zip bison build-essential cmake flex git libedit-dev \
  libllvm15 llvm-15-dev libclang-15-dev python3 zlib1g-dev libelf-dev libfl-dev python3-setuptools \
  liblzma-dev libdebuginfod-dev arping netperf iperf libpolly-15-dev
git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake ..
make
sudo make install
cmake -DPYTHON_CMD=python3 .. # build python3 binding
pushd ../src/python/
make
sudo make install
popd
cd bcc
make install

First ensure the target kernel has the required stuff...

cat /boot/config-$(uname -r) | grep 'CONFIG_BPF'
CONFIG_BPF=y

If `CONFIG_BPF=y`  is in your window somewhere, we are good to go.

What we want to accomplish here with this simple hello world, is to get visibility into when IRIS is doing LInux system calls, without the use of anything but eBPF tooling and the kernel itself.

Here is a good way to go about exploration:

1️⃣ Find a Linux System Call of Interest

sudo ls /sys/kernel/debug/tracing/events/syscalls

For this example, we are going to trap when somebody (modified to trap IRIS), creates a directory through the syscall `sys_enter_mkdir`.



2️⃣ Insert it into the Following Hello World

Your BPF program to load and run is in the variable BPF_SOURCE_CODE,  modify it to include the syscall you want to trap.

# Example eBPF program to a Linux kernel tracepoint
# Modified to trap irisdb
# requires bpfcc-tools
# To run: sudo python3 irismadeadir.py
from bcc import BPF
from bcc.utils import printb

BPF_SOURCE_CODE = r"""
TRACEPOINT_PROBE(syscalls, sys_enter_mkdir) {
bpf_trace_printk("Directory was created by IRIS: %s\n", args->pathname);
return 0;
}
"""
bpf = BPF(text = BPF_SOURCE_CODE)

print("Go create a dir with IRIS...")
print("CTRL-C to exit")

while True:
    try:
        (task, pid, cpu, flags, ts, msg) = bpf.trace_fields()
        #print(task.decode("utf-8"))
        if "iris" in task.decode("utf-8"):
            printb(b"%s-%-6d %s" % (task, pid, msg))
    except ValueError:
        continue
    except KeyboardInterrupt:
        break

 

3️⃣ Load into the Kernel, Run

Create a dir in IRIS

Inspect the trace!

eBPF Powered Binaries

Doesnt take too long when going through the bcc repository and realize that there are plenty of examples, tools and binaries out there that take advantage of eBPF to do fun tracing, and "grep" in this case will suffice to derive some value.

 

Lets do just that on a start and stop of IRIS with some supplied examples.

execsnoop Trace new processes via exec() syscalls.

This one here tells a tale of the arguments to irisdb on start/stop.

 
sudo python3 execsnoopy.py | grep iris

statsnoop Trace stat() syscalls... returns file attributes about an inode, file/dir access.

This one here is informative to dir and file level access during a start/stop... a bit chatty, but informative to what iris is doing during startup, including cpf access, journals, wij activity and the use of system tooling to get the job done.

 
sudo python3 statsnoop.py | grep iris

Flamegraphs

One of the coolest things I stumbled upon with the eBPF tooling was Brendan Gregg's implementation of flamegraphs on top of bpf output to understand performance and stack traces.

Given the following perf recording during a start/stop of IRIS:

sudo perf record -F 99 -a -g -- sleep 60
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 3.701 MB perf.data (15013 samples) ]

Generate the following flame graph with the below:

sudo perf script > out.perf
./stackcollapse-perf.pl out.perf > /tmp/gar.thing
./flamegraph.pl /tmp/gar.thing > flamegraph.svg

I gave it the college try uploading the svg, but it did not work out with this editor, and for some reason was unable to attach it.  Understand though it is interactive and clickable to drill down into stack traces, outside of just looking cool.

  1. The function on the bottom is the function on-CPU. The higher up the y-axis, the further nested the function.
  2. The width of each function on the graph represents the amount of time that function took to execute as a percentage of the total time of its parent function.
  3. Finding functions that are both high on the y-axis (deeply nested) and wide on the x-axis (time-intensive) is a great way to narrow down performance and optimization issues.

"high and wide" <--- 👀

red == user-level

orange == kernel

yellow == c++

green == JIT, java etc.

I really liked this explanation of flamegraph interpetation laid out here  (credit for above) where I derived a baseline understanding on how to read flamegraphs.  Especially powerful for those who are running Python in IRIS on productions with userland code and looking for optimization.

Onward and upward, I hope this piqued your interest, now lets move on to the world of eBPF apps, where the pros have put together phenomanal solutions to put eBPF to work on fleets of systems safely and in a lightweight manner.

Discussion (0)1
Log in or sign up to continue