Hi Scott,

Have you looked at using the Ensemble Enterprise Monitor?  This provides a centralized "pane of glass" for a dashboard type display across multiple production.  Details of using it can be found here in the Ensemble documentation.

Regards,
Mark B-

Hi Ashish,

We are actively working with Nutanix on a potential example reference architecture, but nothing imminent at this time.  The challenges with HCI solutions, Nutanix being one of them, is there is more to the solution that just the nodes themselves.  The network topology and switches play a very important role.  

Additionally, performance with HCI solutions are good...until they aren't.  What I mean by that is performance can be good with HCI/SDDC solutions, however maintaining the expected performance during node failures and/or maintenance periods is the key.  Not all SSDs are created equal, so consideration of storage access performance during all situations such as normal operations, failure conditions, and node rebuild/rebalancing is important.  Also data locality plays a large role too with HCI, and in some HCI solution so does the working dataset size (ie - the larger the data set and random access patterns to that data can have an adverse and unexpected impact on storage latency).

Here's a link to an article I authored regarding our current experiences and general recommendations with HCI and SDDC-based solutions.

https://community.intersystems.com/post/software-defined-data-centers-sddc-and-hyper-converged-infrastructure-hci-–-important

So, in general, be careful when considering any HCI/SDDC solution to not fall into the HCI marketing hype or promises of being "low cost".  Be sure to consider failure/rebuild scenarios when sizing you HCI cluster.  Many times the often quoted "4-node cluster" just isn't ideal and more nodes may be necessary to support performance during failure/maintenance situations within a cluster.  We have come across many of these situations, so test test test.  :)

Kind regards,

Mark B

Hi Jason,

We are working on a similar utility for writes now to support either a solely write or a mixed read/write workload.  I hope to have it posted to the community in the next few weeks.

Kind regards,

Mark B-

Thanks Thomas.  Great article!  

One recommendation I would like to add is with VM-based snapshot backups, we recommend NOT including the VM's memory state as part of the snapshot.  This will greatly reduce the time a VM will be "stunned or paused" that would potentially bump up close to or exceed the QoS value.  Not including the memory state as part of the VM snapshot is OK for the database as recovery never relies on information in memory (assuming the appropriate ExternalFreeze and ExternalThaw APIs are used), since all writes from the database are frozen during the snapshot (journal writes are still occurring).

Hi Paul,

The call-out method is highly customized and depends on the API features of a particular load balancer.  Basically the code is to added to the ^ZMIRROR routine to call whatever API/CLI is available from the load balancer (or the EC2 CLI calls). 

For the  appliance polling method (the one I recommend because it is very simple and clean).   Here is a section from my AWS reference architecture article found here.  The link also provides some good diagrams showing the usage.

AWS Elastic Load Balancer Polling Method

A polling method using the CSP Gateway’s mirror_status.cxw page available in 2017.1 can be used as the polling method in the ELB health monitor to each mirror member added to the ELB server pool.  Only the primary mirror will respond ‘SUCCESS’ thus directing network traffic to only the active primary mirror member. 

This method does not require any logic to be added to ^ZMIRROR.  Please note that most load-balancing network appliances have a limit on the frequency of running the status check.  Typically, the highest frequency is no less than 5 seconds, which is usually acceptable to support most uptime service level agreements.

A HTTP request for the following resource will test the Mirror Member status of the LOCAL Cache configuration.

 /csp/bin/mirror_status.cxw

For all other cases, the path to these Mirror status requests should resolve to the appropriate Cache server and NameSpace using the same hierarchical mechanism as that used for requesting real CSP pages.

Example:  To test the Mirror Status of the configuration serving applications in the /csp/user/ path:

 /csp/user/mirror_status.cxw

Note: A CSP license is not consumed by invoking a Mirror Status check.

Depending on whether or not the target instance is the active Primary Member the Gateway will return one of the following CSP responses:

** Success (Is the Primary Member)

===============================

   HTTP/1.1 200 OK

   Content-Type: text/plain

   Connection: close

   Content-Length: 7

   SUCCESS

** Failure (Is not the Primary Member)

===============================

   HTTP/1.1 503 Service Unavailable

   Content-Type: text/plain

   Connection: close

   Content-Length: 6

   FAILED

** Failure (The Cache Server does not support the Mirror_Status.cxw request)

===============================

   HTTP/1.1 500 Internal Server Error

   Content-Type: text/plain

   Connection: close

   Content-Length: 6

   FAILED

We are receiving more and more requests for VSS integration, so there may be some movement on it, however no guarantees or commitments at this time.  

In regards to the alternative as a crash consistent backup, yes it would be safe as long as the databases, WIJ, and journals are all included and have a consistent point-in-time snapshot.  The databases in the backup archive may be "corrupt", and not until after starting Caché for the WIJ and journals to be applied will it be physically accurate.  Just like you said - a crash consistent backup and the WIJ recovery is key to the successful recovery.  

I will post back if I hear of changes coming with VSS integration.

Hi Dean - thanks for the comment.  There are no changes required from a Caché standpoint, however Microsoft would need to add the similar functionality to Windows to allow for Azure Backup to call a script within the target Windows VM similar to how it is done with Linux.  The scripting from Caché would be exactly the same on Windows except for using .BAT syntax rather then Linux shell scripting once Microsoft provides that capability.  Microsoft may already have it this capability?  I'll have to look to see if they have extended it to Windows as well.

Regards,
Mark B-

I will revise the post to be more clear that THP is enabled by default in 2.6.38 kernel but may be available in prior kernels and to reference your respective Linux distributions documentation for confirming and changing the setting.  Thanks for your comments.