· Jun 6, 2023

Troubleshooting an Ensemble instance

One of our development Ensemble instances is misbehaving. We suspect we'll just need to reinstall it - which would be a hassle. Before we do, we wanted to check we weren't missing some kind of easy fix. The symptoms we are seeing:

  1. A Cache process is running at 100% CPU on one core of the server - its the TASKMGR process
  2. That Cache process resumes at 100% on Ensemble restart, and indeed after server reboot
  3. There might be evidence of corruption in the task schedule: there's a "next scheduled date" of 1840-12-31 00:05...! (yes, we know that's $HOROLOG zero), and Description message that looks like it could be badly copied/misaligned pointer from a previous description. See screengrab, highlighted:  
  4. Opening any DTLs in the management portal, in Studio, or via VS Code, across all namespaces, results in "ERROR in page definition: ERROR #6301: SAX XML Parser Error: XML or TEXT declaration must start at line 1, column 1 while processing Schema at line 2 offset 7"

We don't know whether these symptoms are related or not. Nor can we identify anything in what we've done on that server recently that might correlate to problems starting.

Before we scrub and reinstall the Ensemble instance, anything you might try, places you might look?


Product version: Ensemble 2018.1
$ZV: Cache for Windows (x86-64) 2018.1 (Build 184U) Wed Sep 19 2018 09:09:22 EDT
Discussion (12)2
Log in or sign up to continue

I would check the following:

1. Check the %SYS.Task class with SQL but also do an Integrity check, to see if there are any errors on those globals that hold that task manager data.

2. if the "corrupted"/"copied" task (with $h=0) is the one that consume 100% of CPU, I would try to "re-schedule" it to see if the new "next date" is set to something else. If not, to delete (you don't need to re-create it, looks like 1001 is a copy of 1000)

3. Monitor the 100% CPU task (SMP or JOBEXAM) to try to understand at what commands it's "stuck" 

Thanks Alexander. Terminal Task Manager:

  • can delete other tasks, but
  • trying to delete the "corrupted" one get: "ERROR #5803: Failed to acquire exclusive lock on instance of '%SYS.Task'" (same as when using Management Portal)

Re figuring our where system tasks are stored via journalling, I understand the principle of what you are saying but we are probably reckoning the effort in doing that at least as great as scrubbing and reinstalling - we lose some config (we've got it documented, but the developer who did it originally has left), but no important running code.