Written by

Question Luis Gallardo · Mar 24

System state hung on IRIS running on a Kubernetes

#HealthShare #InterSystems IRIS #InterSystems Kubernetes Operator (IKO) #Kubernetes

Hi!
We are deploying the iris image in a Kubernetes environment and the cluster state is "Hung" , looking the alerts endpoint we get 2 alerts:

[
{
"time":"2026-03-24T13:45:44.548Z",
"severity":"2",
"message":"System appears to have failed over from node a69a9f137593"
},
{
"time":"2026-03-24T13:46:30.274Z",
"severity":"2",
"message":"Error: <PROTECT>KillAlive+1^%SYS.CDIRECT in SERVERS"
}
]
Any idea / help where those are comming from and how to address them? I have detected that the ^%SYSMONMGR is stopped, if I activate it, state pass to ok, and everything is fine
Log section 1
03/24/26-13:45:44:446 (2287) 0 [Utility.Event] Log Monitor Started
03/24/26-13:45:44:548 (2286) 2 [Utility.Event] System appears to have
03/24/26-13:45:44:562 (2288) 0 [Utility.Event] Clean Daemon Started
Log secction 2
03/24/26-13:46:30:271 (2286) 0 [Utility.Event] Processing OS Package Requirements section
03/24/26-13:46:30:274 (2298) 2 [Utility.Event] Error: <PROTECT>
03/24/26-13:46:30:544 (2286) 0 [Generic.Event] Auditing to /infinity/durable/mgr/irisaudit/
Thanks!

Discussion (3)0

Comments

DC AI Bot · Mar 24 Bot

The issue with the system state being “Hung” in an InterSystems IRIS environment on Kubernetes might be related to monitoring services or errors caused by resource configurations.

Explanation of Error Messages:
- Error <PROTECT>KillAlive+1^%SYS.CDIRECT in SERVERS could be indicative of resource locking or monitoring-related issues on the system.
Stopped System Monitoring Manager (^%SYSMONMGR):
- Activating ^%SYSMONMGR resolves the “Hung” state and returns the system to an operational mode. ^%SYSMONMGR monitors resources and takes actions to ensure stability.
Possible Causes for Hung State:
- Deadlocks often happen in IRIS due to overutilization of shared memory or disk being full, leading to the system being stuck. Issues like setting insufficient values for locksiz or running out of disk space for journals are common contributing factors 1.
- Overloaded memory configuration (e.g., bbsiz, gmheap, etc.) can also cause unexpected memory usage spikes leading to system freezing 2.
Mitigation and Recommendations:
- Activate ^%SYSMONMGR to maintain resource monitoring and use proper configurations for memory allocation (gmheap, buffer sizes).
- Regularly reviewing and adjusting system settings like shared memory related parameters can prevent such issues.
- Utilize K8s health metrics (e.g., liveness probes) to automatically restart unresponsive pods 2.
- For troubleshooting further, analyzing messages.log for system events and consulting support could be beneficial 1.

If further clarification or investigation is required, reaching out to the Developer Community for more advanced help would be advised 1 2.

Sources:

1. Are there any known causes of IRIS entering a deadlock/hang state?

2. Memory leak and IRIS container freeze... or just my curvy hands

Tell us more:

Don't like the style

Not factually correct

Didn't fully follow instructions

Refused when it shouldn't have

Being lazy

More...

Submit

0 0 0

Mario Sanchez Macias · Apr 27

Hi Luis,

Regarding the first alert, it indicates that a failover has occurred. This is not an error in itself, but a signal that the previous primary node went down and that this instance took over. From this node's perspective, it’s benign, but it’s important to investigate the original primary, as it may still be down or unstable.

For the <PROTECT> KillAlive+1^%SYS.CDIRECT message, this is less common. I’ve seen similar cases where it was related to the "IRISLIB" database being moved or having incorrect permissions. It would be worth checking that all system databases are in their expected locations and that the filesystem permissions are correct, especially in your durable storage path.

If the issue persists after verifying that, I’d recommend opening a WRC case so it can be analyzed in more detail.

0 0

Luis Gallardo May 18 to Mario Sanchez Macias

Thank you Mario,
We are looking to some things, and if we cannot solve it we will open a WRC case
Thanks!

0 0