Ooops "Daemon WRTDMN died. Freezing system". But why???

Question

Question

Arto Alatalo · May 5, 2020

(We are in contact with IS support for this problem but I would like to ask Community too, perhaps somebody experienced this problem in the past)

Hello Community,

we need your help with Cache 2017.2 freezing on Linux machine.

Since we moved our primary production Cache from Windows to Linux in the begging of this year, we have experienced system freezing twice. Yesterday without any good reason Cache stopped to respond with the log shown below.

Questions:

Any idea what could cause WRTDMN freezing?
What actions it makes sense to take in order to have enough info when the system freezes next time.

Thank you!

System:

Cache for UNIX (Red Hat Enterprise Linux for x86-64) 2017.2.2
CentOS Linux 7

Console log:

05/04/20-12:14:45:624 (182049) 3 Daemon WRTDMN (pid 182058) died. Freezing system
05/04/20-12:15:04:574 (182049) 3 Daemon SLAVWD (pid 182064) died. Freezing system
05/04/20-12:17:45:655 (182071) 2 System Process 'WRTDMN' terminated abnormally (pid 182058)
05/04/20-12:17:45:668 (182071) 0 cleandeadjob: skipping daemon job #2
05/04/20-12:17:45:668 (182071) 2 System Process 'SWRTDMN' terminated abnormally (pid 182064)
05/04/20-12:17:45:668 (182071) 0 cleandeadjob: skipping daemon job #9
05/04/20-12:19:46:403 (182049) 3 Daemon SLAVWD (pid 182061) died. Freezing system
05/04/20-12:19:46:408 (182049) 2 CP: Pausing users because the Write Daemon has not shown
signs of activity for 301 seconds. Users will resume if Write Daemon completes a
pass or writes to disk (wdpass=106040).
05/04/20-12:27:46:491 (182071) 2 System Process 'WRTDMN' terminated abnormally (pid 182058)
05/04/20-12:27:46:497 (182071) 0 cleandeadjob: skipping daemon job #2
05/04/20-12:27:46:497 (182071) 2 System Process 'SWRTDMN' terminated abnormally (pid 182061)
05/04/20-12:27:46:497 (182071) 0 cleandeadjob: skipping daemon job #8
05/04/20-12:27:46:497 (182071) 2 System Process 'SWRTDMN' terminated abnormally (pid 182064)
05/04/20-12:27:46:497 (182071) 0 cleandeadjob: skipping daemon job #9

Discussion (5)1

Log in or sign up to continue

Nick Jones · May 5, 2020

Hi,

We experienced this problem in the past when our system ran out of memory. The operating system automatically teminated the heaviest process to free up sufficient resource. The process it selected was the write daemon.

Check /var/log/messages for messages such as

kernel: Out of memory: Kill process 17959 (cache) score 325 or sacrifice child

kernel: Killed process 17959 (cache) total-vm:247757228kB, anon-rss:128kB, file-rss:32kB, shmem-rss:142117896kB

Cache(ENSEMBLE)[17868]: Daemon WRTDMN (pid 17959) died. Freezing system

If you find these, read up about the operating systems implementaion of "oom-killer"

2 0

score 0 · Answer 1 · 2020-05-05T07:26:57-04:00

Thank you Nick. I've checked the log and have found exactly same reason: oom-killer killed a Cache process. How did you solved the problem? Do you know why Cache eats memory over the limits at first place?

score 1 · Answer 2 · 2020-05-05T07:37:33-04:00

In our case it was not Cache that was misbehaving, we had a Tomcat application that consumed the memory.

We adjusted /proc/sys/vm/swappiness to 30 (and fixed the tomcat appication and tomcat memory settings) and have not had a repeat of the problem.

score 2 · Answer 3 · 2020-05-05T12:18:51-04:00

Your real problem here is the memory usage on the system. It may or may not be Caché using up all the memory, and that's where your investigation should focus, but I wanted to give a technical explanation here for why the write daemon specifically is getting killed.

Most of the memory used by Caché is allocated at instance startup, and is a 'shared memory segment', which you can see with 'ipcs'. Other (Caché and non-Caché) processes allocate memory for individual processing, but the vast majority of memory used by Caché on a running system is this shared memory segment. The largest chunk of that shared memory segment is almost always global buffers (where database blocks are stored for access by Caché processes). Anytime a database block is updated, it is updated in global buffers, and the write daemon will need to access that block in memory and write it to disk. Therefore, the write daemon ends up touching a huge amount of memory on a system, although almost all of that memory is shared. The Linux out of memory killer doesn't prioritize processes using individual memory vs. accessing shared memory segments, so the write daemon is almost always its first target (as it has accessed the most memory), even though killing that process doesn't actually free up much memory for the system (since that shared memory segment doesn't get freed until all other Caché processes detach from it).

score 0 · Answer 4 · 2020-05-05T14:57:22-04:00

Thank you for this interesting info. But indeed on first place I'm interested to know why the system always worked fine on Windows (with very same load) but fails on Linux.