Memory leak and IRIS container freeze... or just my curvy hands

Question

Question

Rostislav Dublin · Apr 27, 2023

#DevOps #Kubernetes #Performance #InterSystems IRIS

I deployed the IRIS container on my Mac M1 Docker Desktop Kubernetes cluster:

image: containers.intersystems.com/intersystems/iris-community-arm64:2023.1.0.229.0

I limited the container 1.5Gb memory:

resources.limits.memory: "1536Mi"

In the "merge.cpf" file I constrained IRIS memory usage aspects:

[config]
globals=0,0,800,0,0,0
gmheap=200000
bbsiz=100000
routines=100

Now I load-test the container by multiple installing and uninstalling the %ZPM package:

install ZPM (zpm-installer.routine and execution):

set r=##class(%Net.HttpRequest).%New()
set r.Server="pm.community.intersystems.com"
set r.SSLConfiguration="ISC.FeatureTracker.SSL.Config"
set status = r.Get("/packages/zpm/latest/installer")
write "GET ZPM installer package status:", status
set status1 = $system.OBJ.LoadStream(r.HttpResponse.Data,"c")
write "Install ZPM status:", status1

> cat /tmp/zpm-installer.routine | iris session iris -U %SYS

uninstall ZPM execution:

> echo 'do $system.OBJ.DeletePackage("%ZPM")' | iris session iris -U %SYS

During such load testing, I monitor the IRIS pod's memory consumption using the Kubernetes Dashboard and IRIS logs, and this is what I see:

The freshly deployed pod consumes 700+ MB of memory. And in the IRIS log, I see: "...Allocated 1235MB shared memory... 800MB global buffers, 100MB routine buffers, 64MB journal buffers, 58MB buffer descriptors, 195MB heap, 5MB ECP, 9MB miscellaneous". So, I am sure my merge.cpf tweaks were merged into the active iris.cpf, and are in use.
After the first test (ZPM install/uninstall) the pod's memory consumption becomes 1.17 GB
After the second test it becomes 1.32 GB
After the third test it becomes 1.44 GB
During the fourth test the pod freezes...

So the questions are:

How the pod can consume more than 1.2 GB if I used all the means I knew for constraining it via the merge.cpf file?
Does it mean some kind of memory leak in the IRIS process?
Am I wrong it is safe in IRIS to actively bulk-compile multiple classes/routines? Should IRIS survive in this scenario or do we have some limitations/bottlenecks in it?
What should be my remediation steps in such outages? Should I use some k8s liveness metrics and simply kill and redeploy such frozen pods? Is it a normal and a recommended scenario for IRIS?

Product version: IRIS 2023.1

$ZV: IRIS for UNIX (Ubuntu Server LTS for ARM64 Containers) 2023.1 (Build 229U) Fri Apr 14 2023 17:20:01 EDT

Discussion (6)2

Log in or sign up to continue

score 0 · Answer 1 · 2023-04-28T02:52:19-04:00

I think you forgot about memory per process, which I would say is not limited at all by default anymore. So, your "leaks", may happen in the processes. ZPM is quite a big package, and the installation will use multiple processes.

So, having most of the memory just mostly for buffers does not work for IRIS, while you need room for the processes, and if you would go to production, you have to have in mind how many active users you would expect and decide how much memory they will consume.

score 0 · Answer 2 · 2023-05-01T13:52:24-04:00

Hi @Dmitry Maslennikov , thx for your reply.

At least, I have not forgotten about the memory-per-process,
and the bbsiz=100000 (100MB) is set, as we can see in my question message.

Likely, you mean I underestimate the number of processes being raised to perform ZPM install/uninstall.

Probably. But I doubt it. That could be the case, if, say, their number is 10 and each of them immediately occupied all its 100MB, so together they might take 1GB. But even if so, in that case, they should cause the failure right from the first ZPM install attempt. But in reality, they do not, and the system survives at least 3 install/uninstall attempts (with gradual memory consumption growth).

Does it mean, that the system fails to perform its cleanup (garbage collection, housekeeping, whatever we call it) work for the processes that completed their work and dangling out of care? Or, maybe, we should conclude that for each new ZPM install/uninstall cycle the system uses the same processes created for the first cycle but fails to garbage collect their memory, so, does it mean... again, the memory leak? 🙄

score 0 · Answer 3 · 2023-05-02T02:11:34-04:00

well, ok, yeah, I did not notice it. But still, the usage per process is still an important part. And even if the leakage is real, it may happen in ZPM itself. The testing scenario does not look like proof much, installing and uninstalling zpm multiple times does not like a real scenario.

Have a look at what will show this query, the result in KB

echo 'select sum(memorypeak) memorypeak,sum(MemoryUsed) memoryused from %SYS.ProcessQuery' | iris sql iris

score 0 · Answer 4 · 2023-05-05T16:39:42-04:00

Hi @Dmitry Maslennikov
and for sure you are right. It was not a real testing scenario, it was just a failure that happened by itself. So I decided to ask the community if such kind of issues is a norm. Or, maybe, I would like to figure out if that system behavior is the norm.

Thank you for the query provided, let me try it a little bit later and share my results with you.

score 0 · Answer 5 · 2023-05-02T02:20:20-04:00

Maybe ZPM uninstall is not ideal. At the moment ZPM is just an open-source app that can be installed into IRIS.

And maybe it doesn't clearly uninstall itself.

What is the business goal of the exercise? To test ZPM or to test IRIS on leakages?

score 0 · Answer 6 · 2023-05-05T16:34:28-04:00

Hi @Evgeny Shvarov, thank you for the response.

Let me answer this: "What is the business goal of the exercise? To test ZPM or to test IRIS on leakages?"

Well, neither nor :)
I just faced it unintentionally when I was developing and deploying
my version of the ZPM download+install script:

So I was busy deploying and running it, then improving and running it repeatedly, and before each next run, I uninstalled the previously installed ZPM... That way I broke it soon and I thought it was something with my k8s cluster. It seemed interesting to me and I decided to repeat the experiment, so I recreated my IRIS container and repeated the scenario from the beginning. And the result was the same each time. That is it