Clear filter
Article
sween · Oct 4, 2024
Runtime Enforcement
So far in the eBPF Journey applied to InterSystems Workloads, we've been pretty much read only when it comes to system calls, binary execution, and file monitoring. But much like the Network Security Policies that were in play with the last post that enforce connectivity, what if we can enforce system calls, file access, and processes in the same manner across an entire cluster ?
Enter, Tetragon, a flexible Kubernetes-aware security observability and runtime enforcement tool that applies policy and filtering directly with eBPF, allowing for reduced observation overhead, tracking of any process, and real-time enforcement of policies.
Enforcement when your application cant provide it.
Where it Runs
Observability and Enforcement Cluster Wide
Up and Running
The obligatory steps to get up and running if you chose to do so, performed in the style of an Isovalent Lab.
Cluster
Kind cluster, 3 worker nodes wide, without a default CNI.
kind.sh
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
networking:
disableDefaultCNI: true
EOF
Cilium
Install Cilium, if for nother else, a CNI.
cilium.sh
cilium install version 1.16.2
Tetragon
Here we install the star of our show, Tetragon, as a daemon set.
tetragon.sh
EXTRA_HELM_FLAGS=(--set tetragon.hostProcPath=/proc) # flags for helm install
helm repo add cilium https://helm.cilium.io
helm repo update
helm install tetragon ${EXTRA_HELM_FLAGS[@]} cilium/tetragon -n kube-system
kubectl rollout status -n kube-system ds/tetragon -w
IRIS Workload
Quick IRIS pod, not privileged, but easily modified to be so. This is the pod we will be executing things on to explain some of the tracing policy behavior.
iris.sh
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: iris-nopriv
labels:
app: iris
spec:
imagePullSecrets:
- name: isc-pull
#hostPID: true # priv flag
#hostNetwork: true # priv flag
containers:
- name: iris-priv
image: containers.intersystems.com/intersystems/iris-community:2024.2
ports:
- containerPort: 80
#securityContext:
#privileged: true # priv flag
EOF
Tracing Policies
TracingPolicies are custom resources that make it easy to setup real-time filters for kernel events. A TracingPolicy matches and filters system calls for observability and also triggers an action on these matches.
Right out of the box though, process_exec and process_exit without having to load any tracing policies.
In one terminal, execute your ZF, in the other, examine the Tetragon events:
kubectl exec -ti -n kube-system tetragon-sw9k4 -c tetragon -- tetra getevents -o compact --pods iris-nopriv
If we take a look at the process execution in Tetragon for the following call out that prints the current working directory.
This may be obvious to you, but running ZF with the "/SHELL" argument, invokes bash, and then calls the command, where as when it is ommitted, it calls out to the binary directly. Now we used the compact output in the above, but if you observe the events in json format, you can see how they are called differently, with the /shell option having a parent process.
ie:
"cwd": "/usr/irissys/bin",
"binary": "/usr/irissys/bin/irisdb",
"arguments": "-w /home/irisowner -s /usr/irissys/mgr",
"flags": "execve",
direct.json
{
"process_exec": {
"process": {
"exec_id": "a2luZC13b3JrZXIyOjI3OTI5NDExNjY1NzMzMzozMDA4MzAw",
"pid": 3008300,
"uid": 51773,
"cwd": "/usr/irissys/mgr/user",
"binary": "/usr/bin/pwd",
"flags": "execve clone",
"start_time": "2024-10-01T02:39:42.761250745Z",
"auid": 4294967295,
"pod": {
"namespace": "default",
"name": "iris-priv",
"container": {
"id": "containerd://5f5df2f9ff01fc737c88f83254ca37c0c214ff7a648c9eabd5f01edcc0804e56",
"name": "iris-priv",
"image": {
"id": "containers.intersystems.com/intersystems/iris-community@sha256:493c073cc968f1053511e6cf56301767ab304f21a60afdf65543e56ef1217cb4",
"name": "containers.intersystems.com/intersystems/iris-community:2024.2"
},
"start_time": "2024-10-01T02:12:16Z",
"pid": 68242
},
"pod_labels": {
"app": "iris"
},
"workload": "iris-priv",
"workload_kind": "Pod"
},
"docker": "5f5df2f9ff01fc737c88f83254ca37c",
"parent_exec_id": "a2luZC13b3JrZXIyOjI3NzgzMjYyNzQ1NzA1ODoyOTc4MzQ0",
"tid": 3008300
},
"parent": {
"exec_id": "a2luZC13b3JrZXIyOjI3NzgzMjYyNzQ1NzA1ODoyOTc4MzQ0",
"pid": 2978344,
"uid": 51773,
"cwd": "/usr/irissys/bin",
"binary": "/usr/irissys/bin/irisdb",
"arguments": "-w /home/irisowner -s /usr/irissys/mgr",
"flags": "execve",
"start_time": "2024-10-01T02:15:21.272075833Z",
"auid": 4294967295,
"pod": {
"namespace": "default",
"name": "iris-priv",
"container": {
"id": "containerd://5f5df2f9ff01fc737c88f83254ca37c0c214ff7a648c9eabd5f01edcc0804e56",
"name": "iris-priv",
"image": {
"id": "containers.intersystems.com/intersystems/iris-community@sha256:493c073cc968f1053511e6cf56301767ab304f21a60afdf65543e56ef1217cb4",
"name": "containers.intersystems.com/intersystems/iris-community:2024.2"
},
"start_time": "2024-10-01T02:12:16Z",
"pid": 67506
},
"pod_labels": {
"app": "iris"
},
"workload": "iris-priv",
"workload_kind": "Pod"
},
"docker": "5f5df2f9ff01fc737c88f83254ca37c",
"parent_exec_id": "a2luZC13b3JrZXIyOjI3NzgzMjYyNjUzMzA5NzoyOTc4MzQ0",
"tid": 2978344
}
},
"node_name": "kind-worker2",
"time": "2024-10-01T02:39:42.761250372Z"
}
shell.json
{
"process_exec": {
"process": {
"exec_id": "a2luZC13b3JrZXIyOjI3OTAxMjk3OTY2NTcxMTozMDA1Mzcy",
"pid": 3005372,
"uid": 51773,
"cwd": "/usr/irissys/mgr/user",
"binary": "/bin/bash",
"arguments": "-p -c pwd",
"flags": "execve clone",
"start_time": "2024-10-01T02:35:01.624258293Z",
"auid": 4294967295,
"pod": {
"namespace": "default",
"name": "iris-priv",
"container": {
"id": "containerd://5f5df2f9ff01fc737c88f83254ca37c0c214ff7a648c9eabd5f01edcc0804e56",
"name": "iris-priv",
"image": {
"id": "containers.intersystems.com/intersystems/iris-community@sha256:493c073cc968f1053511e6cf56301767ab304f21a60afdf65543e56ef1217cb4",
"name": "containers.intersystems.com/intersystems/iris-community:2024.2"
},
"start_time": "2024-10-01T02:12:16Z",
"pid": 68097
},
"pod_labels": {
"app": "iris"
},
"workload": "iris-priv",
"workload_kind": "Pod"
},
"docker": "5f5df2f9ff01fc737c88f83254ca37c",
"parent_exec_id": "a2luZC13b3JrZXIyOjI3NzgzMjYyNzQ1NzA1ODoyOTc4MzQ0",
"tid": 3005372
},
"parent": {
"exec_id": "a2luZC13b3JrZXIyOjI3NzgzMjYyNzQ1NzA1ODoyOTc4MzQ0",
"pid": 2978344,
"uid": 51773,
"cwd": "/usr/irissys/bin",
"binary": "/usr/irissys/bin/irisdb",
"arguments": "-w /home/irisowner -s /usr/irissys/mgr",
"flags": "execve",
"start_time": "2024-10-01T02:15:21.272075833Z",
"auid": 4294967295,
"pod": {
"namespace": "default",
"name": "iris-priv",
"container": {
"id": "containerd://5f5df2f9ff01fc737c88f83254ca37c0c214ff7a648c9eabd5f01edcc0804e56",
"name": "iris-priv",
"image": {
"id": "containers.intersystems.com/intersystems/iris-community@sha256:493c073cc968f1053511e6cf56301767ab304f21a60afdf65543e56ef1217cb4",
"name": "containers.intersystems.com/intersystems/iris-community:2024.2"
},
"start_time": "2024-10-01T02:12:16Z",
"pid": 67506
},
"pod_labels": {
"app": "iris"
},
"workload": "iris-priv",
"workload_kind": "Pod"
},
"docker": "5f5df2f9ff01fc737c88f83254ca37c",
"parent_exec_id": "a2luZC13b3JrZXIyOjI3NzgzMjYyNjUzMzA5NzoyOTc4MzQ0",
"tid": 2978344
}
},
"node_name": "kind-worker2",
"time": "2024-10-01T02:35:01.624257883Z"
}
The JSON events get sent to the tetragon log and can be sent to a SIEM system or observability for actionable insights.
Runtime Enforcement
This may lack a little bit of imagination for a use case, but what if we wanted to forbid anybody from calling out and "catting" the license file?For this, we need to apply a TracingPolicy, that enforces an matchAction These policies are a little involved, but this one is the long way of saying "Hey, if you run `cat /usr/irissys/mgr/iris.key`, I am going to kill you (SIGKILL you).
kubectl apply -f - <<EOF
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "iris-read-file-sigkill"
spec:
kprobes:
- call: "fd_install"
syscall: false
return: false
args:
- index: 0
type: int
- index: 1
type: "file"
selectors:
- matchPIDs:
- operator: NotIn
followForks: true
isNamespacePID: true
values:
- 1
matchArgs:
- index: 1
operator: "Prefix"
values:
- "/usr/irissys/secrets/"
- "/usr/irissys/mgr/iris.key"
#- "/tmp/" # wrecks havoc
matchActions:
- action: FollowFD
argFd: 0
argName: 1
- call: "__x64_sys_close"
syscall: true
args:
- index: 0
type: "int"
selectors:
- matchActions:
- action: UnfollowFD
argFd: 0
argName: 0
- call: "__x64_sys_read"
syscall: true
args:
- index: 0
type: "fd"
- index: 1
type: "char_buf"
returnCopy: true
- index: 2
type: "size_t"
selectors:
- matchActions:
- action: Sigkill
EOF
Once deployed, you should see it loaded as a TracingPolicy resource:
So lets see it enforce the policy:
The -1 tells us something is awry, and the command was unsuccessful.
But not known to the fellow brogrammer, we administratively blocked it and sent a SIGKILL to the process!
That is going to be a long call to the WRC for the unsuspecting end user (or wrc specialist).
Experiments
I found a couple that were interesting in the hundreds I stole, applied, and played around with, notable was one that gave up the system calls per binary. If you really wanted to nerd out, you could literally block by syscall.
Another one that was mesmerizing was the file access TracingPolicy, which showed all processes accessing all the files.
These and other polices can be found in the examples repo @ tetragon:
System calls
Process attributes
Command-line arguments
Network activity
File system operations
Indeed eBPF is powerful and Tetragon looks a great solution to secure or manage the security of your K8s cluster like Sysdig Falco and similar. Great work Ron! And thanks for sharing your work! Love it!
Q: When I used Sysdig in the past it was amazing, however, not very light on system resources. Were you able to quantify the effect of Tetragon on a node? Thanks Hi Luca,Admittedly I am swimming in a pretty good amount of Kool Aid at the moment (by design) as I am attracted to solutions that check a lot of boxes in one full swoop (much like IRIS!), but one of the things I was attracted to in the eBPF space is the promise that it is more "lightweight" on resources and the promise was the "Death of the Sidecar" A: My pre answer with a little hand waving is, more efficient with the network stuff.Layer 5 downloadward, less resource intensive.Layer 7, more resource intensive... sidecar did not die here.eBPF less intensive than Kube Proxy (iptables) uses sequential processing (isovalent benched this and stands by it).Wish I could have looked over your shoulder in the Sysdig eval to see what you were experiencing, if you could paraphrase it and send it my way, Ill take it to the booth.Will circle back to this after KubeCon next month and see if I can get you a real answer backed with my own evaluation, Im a meeting or two away from Tetragon Enterprise eval too, and setup on bare metal so Ill add it to the take aways to get my moneys worth.
Article
sween · Sep 19, 2024
Anakin Skywalker challenged the high ground and has been terribly injured on Mustafar.
He is a relatively new employee of the Galatic Empire, covered by a Large Group Planetary Plan and now has an active encounter in progress for emergent medical services in the Grand Medical Facility on Coruscant. The EMR deployed for the Galactic Health System is Powered by InterSystems FHIR Server running on Kubernetes protected by Cilium.
Let's recreate the technical landscape, to be performed in the style of Isovalent Labs...
Kind Cluster
Lets fire up a 3 node cluster, and disable the CNI so we can replace it with Cilium.
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
networking:
disableDefaultCNI: true
EOF
This will provision the kind cluster, 3 nodes wide with a single control plane.
Cilium
Cilium is an open-source project that provides networking, security, and observability for containerized environments like Kubernetes clusters. It uses a Linux kernel technology called eBPF (extended Berkeley Packet Filter) to inject security, networking, and observability logic into the kernel.In other words, wields the force.
cilium install --version v1.16.0
cilium status --wait
Hubble
Hubble is a clown suit for Cilium, providing ridiculous visibility to what powers Cilium are in play in real time.
cilium enable hubble
InterSystems FHIR Workload
InterSystems is the GOAT of interoperability, and transforms Healthcare Data like a protocol Droid.
kubectl apply -f https://raw.githubusercontent.com/sween/basenube/main/scenarios/ciliumfhir/deploy/cilium-fhir-starwars.yaml
The resulting workload has 4 deployments:
GrandMedicalFacility
Integrated Delivery Network based in Coruscant, with facilities as far as the Outer Rim, runs Epic and utilizes InterSystems I4H as a FHIR Server.
MedicalDroid FX-6
This 1.83-meter-tall droid supplied Vader with a blood transfusion and trained in cybernetic legs procedures.
MedicalDroid DD-13
Also known as the DD-13 tripedal medical droid, this droid has three legs for stability and was designed to install cybernetic implants.
MedicalDroid 2-1B
2-1B droids have hypodermic injectors and precision-crafted servogrip pincers, and can be upgraded to specialize in areas like cybernetic limb replacement, neurosurgery, and alien biology.
Since we will need it anyway for upcoming interviews, lets tell the story in true STAR (Sithuation, Task, Action, Result) methodology.
Sith-uation
Palpatine accompanied the fallen jedi to the facility, and upon arrival helped registration admit him as Darth Vader.
cat > vader.json << 'EOF'
{
"name": [
{
"use": "official",
"family": "Vader",
"given": [
"Darth"
]
}
],
"gender": "male",
"id": "DarthVader",
"birthDate": "1977-05-25",
"resourceType": "Patient"
}
EOF
curl -v -X PUT \
-H "Content-Type: application/fhir+json" \
-d @vader.json \
"http://coruscanthealth:52773/intersystems/fhir/r5/Patient/DarthVader"
Darth Vader is now registered, and can be seen throughout the Health System...
Galactic IT Outage
There is a problem though!
Shortly after registration, a Galactic IT Outage has occured, making the Identity Provider for the Health System unavailable. The InterSystems FHIR Resource Server is SMART enabled, and the IDP is casters up, making EMR launches impossible with the absence of the jwt token with the applicable scopes to protect the routes.
Sure as Bantha Fodder, we definitely have a problem... the care team cannot access the patient record, nothing but 401's and 403's and were not talking about your Galactic Retirement Plan.Although the Hubble UI only provides a hint to what is going on, Inspecting the Hubble flows with Layer 7 information reveals the sithuation.
...and after adding some debugs bunny to the InterSystems FHIR endpoint, confirms it.
FHIR Debug
zn "USER"
Set ^FSLogChannel("all")=1
zn "%SYS"
Set ^%ISCLOG=5
Set ^%ISCLOG("Category","HSFHIR")=5
Set ^%ISCLOG("Category","HSFHIRServer")=5
Set ^%ISCLOG("Category","OAuth2")=5
Set ^%ISCLOG("Category","OAuth2Server")=5
zw^FSLOG
...
^FSLOG(379555)="DispatchRequest^HS.FHIRServer.Service^944|Msg|Dispatch interaction read for Patient/DarthVader|09/19/2024 10:48:20.833339AM"
^FSLOG(379556)="DispatchRequest^HS.FHIRServer.Service^944|Msg|Request Completed in .000186 secs: Patient/DarthVader|09/19/2024 10:48:20.833450AM"
^FSLOG(379557)="processRequest^HS.FHIRServer.RestHandler^944|Msg|Response Status: 401, Json: Patient|09/19/2024 10:48:20.833454AM"
...
Task
Action
With the route enforcement from SMART not applicable, lets do this our way and use Cilium to protect the endpoints while Vader gets immediate attention the Emperor demands. We will go Rogue One here on the cluster and hand off the endpoint/route protection to Cilium while the Galaxy figures itself out from the outage.Lets institute a deny all, from everywhere, with a CiliumClusterwideNetworkPolicy, and work backwards zero trust style.
cat <<EOF | kubectl apply -n galactic-fhir -f-
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
name: "denyall-coruscanthealth"
spec:
description: "Block all the traffic (except DNS) by default"
egress:
- toEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: '53'
protocol: UDP
rules:
dns:
- matchPattern: '*'
endpointSelector:
matchExpressions:
- key: io.kubernetes.pod.namespace
operator: NotIn
values:
- kube-system
EOF
Looking good, Cilium dropping it like its hot!Now, lets open up the FHIR endpoint on the intersystems pod, disabling the oauth2 client.
set app = "/intersystems/fhir/r5"
Set strategy = ##class(HS.FHIRServer.API.InteractionsStrategy).GetStrategyForEndpoint(app)
// 7 = Mass Openness
Set configData.DebugMode = 7
Set configData = strategy.GetServiceConfigData()
Do strategy.SaveServiceConfigData(configData)
Lastly, lets create a CiliumNetworkPolicy to allow anybody from the org:empire, access to the route for DarthVaders record in the galactic-fhir namespace.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "l7-visibility"
spec:
endpointSelector:
matchLabels:
org: empire
egress:
- toPorts:
- ports:
- port: "53"
protocol: ANY
rules:
dns:
- matchPattern: "*"
- toEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": galactic-fhir
toPorts:
- ports:
- port: "52773"
protocol: TCP
rules:
http:
- method: "GET"
path: "/intersystems/fhir/r5/Patient/DarthVader"
- method: "HEAD"
path: "/intersystems/fhir/r5/Patient/DarthVader"
EOF
Looks like we may be able to get back to iRacing, I think we are good.
...except
Yeah, looks like the payer is getting dropped...Policy verdict = DROPPED
Let's add another policy, allowing org:payer access to Vaders route:
cat <<EOF | kubectl apply -n galactic-fhir -f-
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "l7-visibility-payer"
spec:
endpointSelector:
matchLabels:
org: payer
egress:
- toPorts:
- ports:
- port: "53"
protocol: ANY
rules:
dns:
- matchPattern: "*"
- toEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": galactic-fhir
toPorts:
- ports:
- port: "52773"
protocol: TCP
rules:
http:
- method: "GET"
path: "/intersystems/fhir/r5/Patient/DarthVader"
- method: "HEAD"
path: "/intersystems/fhir/r5/Patient/DarthVader"
EOF
Welp, that did not quite cut it, and we can see why. So we gave the payer a call and told them to access the "correct" patient record, and Anakin Vader gets his legs.
Rant time...Result
Great article, Ron!
ICD, DSM, SNOMED, and other classifiers on Coruscant must be crazy.
Article
sween · Sep 9, 2024
I attended Cloud Native Security Con in Seattle with full intention of crushing OTEL day, then perusing the subject of security applied to Cloud Native workloads the following days leading up to CTF as a professional excercise. This was happily upended by a new understanding of eBPF, which got my screens, career, workloads, and atitude a much needed upgrade with new approaches to solving workload problems.
So I made it to the eBPF party and have been attending clinic after clinic on the subject ever since, here I would like to "unbox" eBPF as a technical solution, mapped directly to what we do in practice (even if its a bit off), and step through eBPF through my experimentation on supporting InterSystems IRIS Workloads, particularly on Kubernetes, but not necessarily void on standalone workloads.
eBee Steps with eBPF and InterSystems IRIS Workloads
eBPF
eBPF (extended Berkeley Packet Filter), is a killer Linux kernel feature that implements a VM within kernel space and makes it possible to run sandboxed apps safely with guiderails. These apps can "map" data into user land for observability, tracing, security, and networking. I think of it as a "sniffer" of the OS, where traditionally it was associated with BPF and networking, and the extended version "sniffing" tracepoints, processes, scheduling, execution and block device access. If you didnt buy my analogy of eBPF, here is one from the pros:
"What JavaScript is to the browser, eBPF is to the Linux Kernel"
JavaScript lets you attach callbacks to events in the DOM in order to bring dynamic features to your web page. In a similar fashion, eBPF allows to hook to kernel events and extend their logic when these events are triggered!
Immediately Applicable
IF; the following prometheus metric seems impossible to you, employ eBPF to watch processes that are supposed to be there and monitor in band through the kernel.
# HELP iris_instance_status The thing thats down telling us its down.
# TYPE iris_instance_status gauge
iris_instance_status 0
IF; you are tired of begging for the following for a sidecar to get needed observability, Goodbye sidecars
iris-sidecar:
resources:
requests:
memory: "2Mi"
cpu: "125m"
Where
One of the most satisfying things about how eBPF is applied, is where it runs... in the a VM, inside the kernel. And thanks to Linux Namespacing, you can guess how powerful that is in a cloud native environment, let alone a kernel sitting in some sort of virtualizing or a big iron ghetto blaster machine with admirational hardware.
Obligatory Hello World
For those of you who like to try things from themselves and from the "beginning" so to speak, I salute you with an obligatory Hello World, twisted to be a tad bit "irisy." However, its mostly undersstood that programming in eBPF wont become a skill that is frequently excercised, but concentrated on individuals who do Linux kernel development or building next generation monitoring tools.
I run Pop OS/Ubuntu, and here is my cheat code to getting into the eBPF world quickly on 23.04:
sudo apt install -y zip bison build-essential cmake flex git libedit-dev \
libllvm15 llvm-15-dev libclang-15-dev python3 zlib1g-dev libelf-dev libfl-dev python3-setuptools \
liblzma-dev libdebuginfod-dev arping netperf iperf libpolly-15-dev
git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake ..
make
sudo make install
cmake -DPYTHON_CMD=python3 .. # build python3 binding
pushd ../src/python/
make
sudo make install
popd
cd bcc
make install
First ensure the target kernel has the required stuff...
cat /boot/config-$(uname -r) | grep 'CONFIG_BPF'
CONFIG_BPF=y
If `CONFIG_BPF=y` is in your window somewhere, we are good to go.
What we want to accomplish here with this simple hello world, is to get visibility into when IRIS is doing LInux system calls, without the use of anything but eBPF tooling and the kernel itself.
Here is a good way to go about exploration:
1️⃣ Find a Linux System Call of Interest
sudo ls /sys/kernel/debug/tracing/events/syscalls
For this example, we are going to trap when somebody (modified to trap IRIS), creates a directory through the syscall `sys_enter_mkdir`.
2️⃣ Insert it into the Following Hello World
Your BPF program to load and run is in the variable BPF_SOURCE_CODE, modify it to include the syscall you want to trap.
# Example eBPF program to a Linux kernel tracepoint
# Modified to trap irisdb
# requires bpfcc-tools
# To run: sudo python3 irismadeadir.py
from bcc import BPF
from bcc.utils import printb
BPF_SOURCE_CODE = r"""
TRACEPOINT_PROBE(syscalls, sys_enter_mkdir) {
bpf_trace_printk("Directory was created by IRIS: %s\n", args->pathname);
return 0;
}
"""
bpf = BPF(text = BPF_SOURCE_CODE)
print("Go create a dir with IRIS...")
print("CTRL-C to exit")
while True:
try:
(task, pid, cpu, flags, ts, msg) = bpf.trace_fields()
#print(task.decode("utf-8"))
if "iris" in task.decode("utf-8"):
printb(b"%s-%-6d %s" % (task, pid, msg))
except ValueError:
continue
except KeyboardInterrupt:
break
3️⃣ Load into the Kernel, Run
Create a dir in IRIS
Inspect the trace!
eBPF Powered Binaries
Doesnt take too long when going through the bcc repository and realize that there are plenty of examples, tools and binaries out there that take advantage of eBPF to do fun tracing, and "grep" in this case will suffice to derive some value.
Lets do just that on a start and stop of IRIS with some supplied examples.
execsnoop Trace new processes via exec() syscalls.
This one here tells a tale of the arguments to irisdb on start/stop.
sudo python3 execsnoopy.py | grep iris
iris 3014275 COMM PID PPID RET ARGS
3011645 0 /usr/bin/iris stop IRIS quietly restart
irisstop 3014275 3011645 0 /usr/irissys/bin/irisstop quietly restart
irisdb 3014276 3014275 0 ./irisdb -s/data/IRIS/mgr/ -cV
irisdb 3014277 3014275 0 ./irisdb -s/data/IRIS/mgr/ -U -B OPT^SHUTDOWN(1)
irisdb 3014279 3014275 0 ./irisdb -s/data/IRIS/mgr/ -cV
irisdb 3014280 3014275 0 ./irisdb -s/data/IRIS/mgr/ -cV
sh 3014281 3014275 0 /bin/sh -c -- /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb 3014282 3014281 0 /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb 3014283 3014275 0 ./irisdb -s/data/IRIS/mgr/ -cV
irisrecov 3014284 3014275 0 ./irisrecov /data/IRIS/mgr/ quietly
iriswdimj 3014678 3014284 0 /usr/irissys/bin/iriswdimj -t
iriswdimj 3014679 3014284 0 /usr/irissys/bin/iriswdimj -j /data/IRIS/mgr/
rm 3014680 3014284 0 /usr/bin/rm -f iris.use
irisdb 3014684 3014275 0 ./irisdb -s/data/IRIS/mgr/ -w/data/IRIS/mgr/ -cd -B -V CLONE^STU
sh 3014685 3014275 0 /bin/sh -c -- /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb 3014686 3014685 0 /usr/irissys/bin/irisdb -s /data/IRIS/mgr/ -cL
irisdb 3014687 3014275 0 ./irisdb -s/data/IRIS/mgr/ -cV
irisrecov 3014688 3014275 0 ./irisrecov /data/IRIS/mgr/ quietly
iriswdimj 3015082 3014688 0 /usr/irissys/bin/iriswdimj -t
iriswdimj 3015083 3014688 0 /usr/irissys/bin/iriswdimj -j /data/IRIS/mgr/
rm 3015084 3014688 0 /usr/bin/rm -f iris.use
irisdb 3015088 3014275 0 ./irisdb -s/data/IRIS/mgr/ -w/data/IRIS/mgr/ -cc -B -C/data/IRIS/iris.cpf*IRIS
irisdb 3015140 3014275 0 ./irisdb -s/data/IRIS/mgr/ -w/data/IRIS/mgr/ -U -B -b1024 -Erunlevel=sys/complete QUIET^STU
irisdb 3015142 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 START^MONITOR
irisdb 3015143 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 START^CLNDMN
irisdb 3015144 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 ErrorPurge^Config.Startup
irisdb 3015145 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 START^LMFMON
irisdb 3015146 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 ^RECEIVE
irisdb 3015147 3015146 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p16 SCAN^JRNZIP
irisdb 3015148 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 OneServerJob^STU
irisdb 3015149 3015148 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p19 Master^%SYS.SERVER
irisdb 3015150 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 systemRestart^%SYS.cspServer2
irisdb 3015151 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 SERVERS^STU1
requirements_ch 3015152 3015140 0 /usr/irissys/bin/requirements_check
dirname 3015153 3015152 0 /usr/bin/dirname /usr/irissys/bin/requirements_check
httpd 3015215 3015151 0 /usr/irissys/httpd/bin/httpd -f /data/IRIS/httpd/conf/httpd.conf -d /usr/irissys/httpd -c Listen 52773
irisdb 3015362 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 OnSystemStartup^HS.FHIRServer.Util.SystemStartup
irisdb 3015363 3015140 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p13 OnSystemStartup^HS.HC.Util.SystemStartup
irisdb 3015364 3015151 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 RunManager^%SYS.Task
irisdb 3015365 3015151 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 Start^%SYS.Monitor.Control
irisdb 3015366 3015151 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 Daemon^LOGDMN
irisdb 3015367 3015151 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 RunDaemon^%SYS.WorkQueueMgr
irisdb 3015368 3015151 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p21 RunRemoteQueueDaemon^%SYS.WorkQueueMgr
irisdb 3015369 3015362 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p19 RunAll^HS.HC.Util.Installer.Upgrade.BackgroundItem
irisdb 3015370 3014275 0 ./irisdb -s/data/IRIS/mgr/ -cV
irisdb 3015436 3015367 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p25 startWork^%SYS.WorkQueueMgr
irisdb 3015437 3015367 0 /usr/irissys/bin/irisdb -s/data/IRIS/mgr -cj -p25 startWork^%SYS.WorkQueueMgr
statsnoop Trace stat() syscalls... returns file attributes about an inode, file/dir access.
This one here is informative to dir and file level access during a start/stop... a bit chatty, but informative to what iris is doing during startup, including cpf access, journals, wij activity and the use of system tooling to get the job done.
sudo python3 statsnoop.py | grep iris
3016831 irisdb 0 0 /data/IRIS/mgr/
3016831 irisdb 0 0 /data/IRIS/mgr/
3016825 irisstop 0 0 /data/IRIS/mgr
3016825 irisstop 0 0 /usr/irissys/bin/irisuxsusr
3016825 irisstop 0 0 ./irisdb
3016825 irisstop 0 0 ../bin
3016832 sh -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016832 sh -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016832 sh 0 0 /usr/irissys/bin/
3016832 sh 0 0 /home/irisowner
3016833 irisdb -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016833 irisdb -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016833 irisdb 0 0 /usr/irissys/bin/
3016833 irisdb 0 0 /data/IRIS/mgr/
3016833 irisdb 0 0 /data/IRIS/mgr/
3016833 irisdb 0 0 /data/IRIS/mgr/
3016834 irisstop 0 0 ./irisdb
3016834 irisstop 0 0 ../bin
3016834 irisdb -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016834 irisdb -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016834 irisdb 0 0 /usr/irissys/bin/
3016834 irisdb 0 0 /data/IRIS/mgr/
3016834 irisdb 0 0 /data/IRIS/mgr/
3016835 irisstop 0 0 ./irisrecov
3016835 irisstop 0 0 ../bin
3016835 irisrecov -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v3/
3016835 irisrecov -1 2 /usr/irissys/bin/glibc-hwcaps/x86-64-v2/
3016835 irisrecov 0 0 /usr/irissys/bin/
3016835 irisrecov 0 0 /home/irisowner
3016835 irisrecov 0 0 .
3016835 irisrecov 0 0 iris.cpf
3016841 irisrecov 0 0 /usr/bin/cut
3016841 irisrecov 0 0 /usr/bin/tr
3016841 irisrecov 0 0 /usr/bin/sed
3017761 requirements_ch 0 0 /home/irisowner
3017761 requirements_ch -1 2 /usr/irissys/bin/requirements.isc
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb 0 0 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /usr/lib64/libcrypto.so.1.1
3017691 irisdb -1 2 /usr/lib64/libcrypto.so.3
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/iris.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf_20240908
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/_LastGood_.cpf_5275
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb -1 2 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /etc/localtime
3017691 irisdb 0 0 /data/IRIS/_LastGood_.cpf
3017691 irisdb 0 0 /data/IRIS/mgr/irisaudit/
3017691 irisdb 0 0 /data/IRIS/mgr/irisaudit/
3017691 irisdb 0 0 /data/IRIS/mgr/irisaudit/
3017691 irisdb 0 0 /data/IRIS/mgr/irisaudit/
3017691 irisdb 0 0 /data/IRIS/mgr/irisaudit/
3017691 irisdb 0 0 /data/IRIS/mgr/irisaudit/
3017756 irisdb -1 2 /data/IRIS/mgr/journal/20240908.002
3017756 irisdb -1 2 /data/IRIS/mgr/journal/20240908.002
3017756 irisdb 0 0 /data/IRIS/mgr/journal/20240908.002z
3017756 irisdb -1 2 /data/IRIS/mgr/journal/20240908.002
3017756 irisdb 0 0 /data/IRIS/mgr/journal/20240908.002z
3017756 irisdb -1 2 /data/IRIS/mgr/journal/20240908.001
Flamegraphs
One of the coolest things I stumbled upon with the eBPF tooling was Brendan Gregg's implementation of flamegraphs on top of bpf output to understand performance and stack traces.Given the following perf recording during a start/stop of IRIS:
sudo perf record -F 99 -a -g -- sleep 60
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 3.701 MB perf.data (15013 samples) ]
Generate the following flame graph with the below:
sudo perf script > out.perf
./stackcollapse-perf.pl out.perf > /tmp/gar.thing
./flamegraph.pl /tmp/gar.thing > flamegraph.svg
I gave it the college try uploading the svg, but it did not work out with this editor, and for some reason was unable to attach it. Understand though it is interactive and clickable to drill down into stack traces, outside of just looking cool.
The function on the bottom is the function on-CPU. The higher up the y-axis, the further nested the function.
The width of each function on the graph represents the amount of time that function took to execute as a percentage of the total time of its parent function.
Finding functions that are both high on the y-axis (deeply nested) and wide on the x-axis (time-intensive) is a great way to narrow down performance and optimization issues.
"high and wide" <--- 👀
red == user-level
orange == kernel
yellow == c++
green == JIT, java etc.
I really liked this explanation of flamegraph interpetation laid out here (credit for above) where I derived a baseline understanding on how to read flamegraphs. Especially powerful for those who are running Python in IRIS on productions with userland code and looking for optimization.Onward and upward, I hope this piqued your interest, now lets move on to the world of eBPF apps, where the pros have put together phenomanal solutions to put eBPF to work on fleets of systems safely and in a lightweight manner.
Article
sween · Sep 10, 2024
So if you are following from the previous post or dropping in now, let's segway to the world of eBPF applications and take a look at Parca, which builds on our brief investigation of performance bottlenecks using eBPF, but puts a killer app on top of your cluster to monitor all your iris workloads, continually, cluster wide!
Continous Profiling with Parca, IRIS Workloads Cluster Wide
Parca
Parca is named after the Program for Arctic Regional Climate Assessment (PARCA) and the practice of ice core profiling that has been done as part of it to study climate change. This open source eBPF project aims to reduce some carbon emissions produced by unnecessary resource usage of data centers, we can use it to get "more for less" with resource consumption, and optimize on our cloud native workloads running IRIS.
Parca is a continuous profiling project. Continuous profiling is the act of taking profiles (such as CPU, Memory, I/O and more) of programs in a systematic way. Parca collects, stores and makes profiles available to be queried over time, and due to its low overhead using eBPF can do this without detrimenting the target workloads.
Where
If you thought monitoring a kernel that runs multiple linux kernel namespaces was cool on the last post, Parca manages to bring all of that together in one spot, with a single pane of glass across all nodes (kernels) in a cluster.
Parca two main components:
Parca: The server that stores profiling data and allows it to be queried and analyzed over time.
Parca Agent: An eBPF-based whole-system profiler that runs on the nodes.
To hop right into "Parca applied", I configured Parca on my cluster with the following:
kubectl create namespace parca
kubectl apply -f https://github.com/parca-dev/parca/releases/download/v0.21.0/kubernetes-manifest.yaml
kubectl apply -f https://github.com/parca-dev/parca-agent/releases/download/v0.31.1/kubernetes-manifest.yaml
Results in a daemonset, running the agent on all 10 nodes, with about 3-4 iris workloads scattered throughout the cluster.
Note: Parca runs standalone too, no k8s reqd!
Lets Profile
Now, I know I have a couple of workloads on this cluster of interest, one of them is a fhir workload that is servicing a GET on the /metadata endpoint for 3 pods on an interval for friends I am trying to impress at an eBPF party, the other is a straight up 2024.2 pod running the following as a JOB:
Class EBPF.ParcaIRISPythonProfiling Extends %RegisteredObject
{
/// Do ##class(EBPF.ParcaIRISPythonProfiling).Run()
ClassMethod Run()
{
While 1 {
HANG 10
Do ..TerribleCode()
Do ..WorserCode()
Do ..OkCode()
zn "%SYS"
do ##class(%SYS.System).WriteToConsoleLog("Parca Demo Fired")
zn "PARCA"
}
}
ClassMethod TerribleCode() [ Language = python ]
{
import time
def terrible_code():
time.sleep(30)
print("TerribleCode Fired...")
terrible_code()
}
ClassMethod WorserCode() [ Language = python ]
{
import time
def worser_code():
time.sleep(60)
print("WorserCode Fired...")
worser_code()
}
ClassMethod OkCode() [ Language = python ]
{
import time
def ok_code():
time.sleep(1)
print("OkCode Fired....")
ok_code()
}
}
Now, I popped a metallb service on the parca service and dove right into the console, lets take a peak at what we can observe in the two workloads.
Python Execution
So I didnt get what I wanted out of the results here, but I did get some hints on how IRIS is doing the whole python integration thing.
In Parca, I constrained on the particular pod, summed it by the same thing and selected a sane timeframe:
And here was the resulting pprof:
I can see irisdb doing the Python Execution, traces with ISCAgent, and on the right I can see basically iris init stuff in the container. Full transparency, I was expecting to see the python methods, so I have to work on on that, but I did learn pythoninit.so is the star of the python call out show.
FHIR Thinger
Now this one does show some traces from a kernel perspective relevant to a FHIR workload. On the left, you can see the apache threads for the web server standing up the api, and you can also see in the irisdb traces the unmarshalling of JSON.
All spawning from a thread by what is known as a `zu210fun` party!
Now, lets take a look at the same workload in Grafana as Parca exports to observability:
Not earth shattering I know, but the point being distributed profiling an IRIS app with an eBPF, in lightweight fashion, across an entire cluster... with the sole goal of and not ever having to ask a customer for a pButtons report again!