How to remove a job PID from a production

Problem:

A file-based business service uses a local path on a Linux machine that is actually a mounted CIFS share. The mount is "soft" and is designed to not cache data, etc. There are times however when the remote system offering up the share (it's a Windows machine I believe) gets bounced or otherwise hung up the business service in the Ensemble production just hangs.

Un-mounting the network share doesn't affect it, no process kill command affects it, and even going so far as to "kill -9" the process outside of Cache does nothing, either.  So the "Update Production" button stays lit and the job never goes away - even if at the system level the connectivity to the shared mount point is restored. 

Question - The job is gonzo and won't come back, ever, is there a way to "surgically remove" the PID from the list of the jobs the production "thinks" is there? I don't care if the PID is zombied, or for whatever reason not responding, I just want to make the production "forget" that PID is part of the running production. That way I can start another instance of the business service until such time as a bounce of the production or reboot of the machine is scheduled and performed.

Any advice is welcome! Thanks!

  • 0
  • 0
  • 910
  • 4
  • 1

Answers

Hi Ryan,

You'll need to make sure the process is terminated on the OS level. Then, you can use the following command to unregister the PID from Ensemble:

Do ##class(Ens.Job).UnRegister("<config name>",<PID>)

Restarting the production should also work, but you might need to force shutdown the production.

HTH,

Wilber

Comments

When you write "no process kill command affects it", are you referring to commands issued from within Ensemble Portal or perhaps command prompt?

Also, what is the $ZV string of your Ensemble?

Both - inside and outside (using "kill -9" in Linux). So just going out and whacking whatever temp globals/tables that tell Ensemble the job is there should be safe to do. 

$ZV is "Cache for UNIX (Red Hat Enterprise Linux for x86-64) 2014.1.2 (Build 753U) Tue Jul 22 2014 11:25:14 EDT"

Thanks!

It sounds to me that something could be wrong with the parent process of the one you are trying to kill. If you haven't already, I strongly suggest opening a WRC issue for this. It would be worth trying to get a trace (ltrace or strace) of both this process and the parent to see what is going on at the OS level.