How to determine/record outages

I have been tasked with setting up outage notifications to my group (for example, sending an email or automatically generating a problem incident when HealthShare/Ensemble experiences an outage), and determining outage statistics (for example, the time frame when HealthShare/Ensemble was unavailable).  I admit I am having a hard time even defining what should be considered an “outage”.  Currently, we are using HealthShare 2017.2.1 (Cache for UNIX (Red Hat Enterprise Linux for x86-64) 2017.2.1 (Build 801_3_18095U) Mon May 7 2018 14:40:07 EDT [HealthShare Modules:Core:15.031.8653 + Linkage Engine:15.03.8653]), primarily for HL7 TCP and file transfers only.  We have a primary/failover set up with two servers, plus a third server set up as DR.  

Do you have any recommendations or suggestions for how to define an outage, how to determine if an outage has occurred, and how to determine how long the outage lasted?  Any insight or direction is greatly appreciated. 

  • 0
  • 0
  • 120
  • 0
  • 1

Answers

Hi Ruth,  

Thanks for your questions.  I agree that an outage is difficult to define, but I'll mention a few things you should keep an eye on to give you a starting place.  If you're not sure about a particular concept below, please take a look at the HealthShare/Ensemble Documentation for more information. 

1) The Red Hat Server: A HealthShare/Ensemble instance runs on top of a server, so it's important to setup server-level monitoring to report any outages.  This might come in the form of a secondary server recording network pings and noting when your HealthShare server doesn't respond or something more complex.  Since you're running on Red Hat, work with your Linux admin to get more information on this.

2) Your HealthShare/Ensemble Instance(s): You can check whether your instance is running from Linux shell by running "ccontrol list" (run "ccontrol help" for additional information).  Once again, your Linux admin should be able to write an OS script to continuously check if HealthShare/Ensemble is running and email if something is wrong.  There are also a number of screens in the Management Portal including the 'System Dashboard' and 'Ensemble System Monitor' which list a value called "System Up Time" which shows how long the instance has been running.  Note also that the "cconsole.log" file contains phrases like "Recovery started at Wed Aug 15 07:03:04 2018" which denote when the instances starts.  

3) Your Ensemble Production(s) within an Instance:  You will have one or more Ensemble Namespaces on your instance, each of which can supporting exactly one running Ensemble Production. A Running Ensemble Production has a status of "Running", so if you define an outage as your Ensemble Production being down, you can look for a status other than "Running".  There are a number of classmethods in the 'Ens.Director' class which will report this information [like GetProductionStatus() and  IsProductionRunning()].  

4) Your Individual Business Host(s): An Ensemble production will have a number of Business Services, Business Processes, and Business Operations which might be in an Error or other bad state.  Depending on how you define "outage", you might want to check on the status of these items using the "Ensemble Production Monitor" to confirm they're running properly.  

I hope this helps.  

Best Regards,

Sebastian Musielak