Hi Ruth,
Thanks for your questions. I agree that an outage is difficult to define, but I'll mention a few things you should keep an eye on to give you a starting place. If you're not sure about a particular concept below, please take a look at the HealthShare/Ensemble Documentation for more information.
1) The Red Hat Server: A HealthShare/Ensemble instance runs on top of a server, so it's important to setup server-level monitoring to report any outages. This might come in the form of a secondary server recording network pings and noting when your HealthShare server doesn't respond or something more complex. Since you're running on Red Hat, work with your Linux admin to get more information on this.
2) Your HealthShare/Ensemble Instance(s): You can check whether your instance is running from Linux shell by running "ccontrol list" (run "ccontrol help" for additional information). Once again, your Linux admin should be able to write an OS script to continuously check if HealthShare/Ensemble is running and email if something is wrong. There are also a number of screens in the Management Portal including the 'System Dashboard' and 'Ensemble System Monitor' which list a value called "System Up Time" which shows how long the instance has been running. Note also that the "cconsole.log" file contains phrases like "Recovery started at Wed Aug 15 07:03:04 2018" which denote when the instances starts.
3) Your Ensemble Production(s) within an Instance: You will have one or more Ensemble Namespaces on your instance, each of which can supporting exactly one running Ensemble Production. A Running Ensemble Production has a status of "Running", so if you define an outage as your Ensemble Production being down, you can look for a status other than "Running". There are a number of classmethods in the 'Ens.Director' class which will report this information [like GetProductionStatus() and IsProductionRunning()].
4) Your Individual Business Host(s): An Ensemble production will have a number of Business Services, Business Processes, and Business Operations which might be in an Error or other bad state. Depending on how you define "outage", you might want to check on the status of these items using the "Ensemble Production Monitor" to confirm they're running properly.
I hope this helps.
Best Regards,
Sebastian Musielak
Hello Hans,
To expand on Fabian's comments, the Write Daemon goes into Phase 8 just before the process issues a disk write command to the Linux operating system. The Write Daemon remains in phase 8 for the duration of the disk write and only moves on after the operating system returns a 'success' for the disk write.
In your case, it seems that your Backup mirror member remains in phase 8 for a few more seconds than expected when compared to the primary member. The most likely cause is that writes to disk are taking longer on your backup than your primary.
You can confirm this by looking at the pButtons report. The report has logs of Linux OS tools like 'SAR -d', 'VMSTAT', and 'IOSTAT', all of which should report IO statistics such as disk write time. If your disk write time is high, then the WD will be in phase 8 for a longer time.
I hope this helps.
Hello Miguel,
My name is Sebastian Musielak and I'm one of the Product Owners of HealthShare. I've also spent some time in HealthShare Support helping customers just like you go through the upgrade process.
First off, I'd like to say that I'm happy to hear that you are upgrading to latest release of HealthShare Information Exchange, HS 2018.1. That's great news. That being said, I have a few questions and comments about your upgrade:
1) Have you been in touch with your Account Team at InterSystems to get their thoughts on the upgrade process? Your Account Manager and Sales Engineer are a great resource in getting started in this process.
2) What is your timeline for this upgrade? In my personal experience, HealthShare customers typically take a few months of testing their customizations on the upgrade version before performing the actual upgrade in their LIVE environment. Please make sure to adjust your "Go Live" date accordingly. Don't forget that you also have the option of moving to the next release of HealthShare (2019.1) when that comes out as well.
3) In my experience with customers, the biggest hurdles they face in upgrading a HealthShare environment has been dealing with customizations. If you haven't already, I strongly recommend you take a thorough inventory of all of the customizations that you have made in your version and document why that customization is there (whether you're fixing a bug in your version or adding functionality that wasn't there in your version). Unfortunately, the HealthShare Version you're running (HS Core:14.01 running on Ensemble 2015.2.1) is quite a bit older than the latest release version (HS 2018.1), so for each customization that you have made, you need to confirm whether it should still be included in HS 2018.1.
4) Regarding the notion of performing an "In-Place" upgrade vs. a "Migration" to a parallel set of servers, I can tell you that the InterSystems HealthShare Quality Development team has not tested the migration of code and data from HS Core 14.01 to HS 2018.1, so we can not recommend that approach for upgrading your LIVE environment. What we will recommend though is creating a TEST environment which mimics your LIVE environment as closely as possible. With your parallel TEST environment running on the same version as your LIVE environment, you can perform upgrade tests which will allow you to see how well your customizations work in HS 2018.1. That will give you a good estimate for how long an in-place upgrade might take.
5) Finally, to address some of your Knowledge Gaps:
I hope this give you a good place to start.
Good luck on the upgrade!
Best Regards,
Sebastian Musielak
Product Specialist – HealthShare
InterSystems Corporation