Hi Alexey,

I can help with your question.  The reason this is the way it is because you can't (or at least shouldn't) have a database file (CACHE.DAT or IRIS.DAT) opened in contending modes (open both as unbuffered and buffered) to avoid file corruption or stale data.  Now the actual writing of the online backup CBK file can be a buffered write because it is independent of the DB as you mentioned, but the actual reads of the database blocks from the online backup utility will be unbuffered direct IO reads.  This is where the slow-down may occur: from the reading the database blocks and not the actual writing of the CBK backup file.

Regards,
Mark B-

Hi Scott,

Have you looked at using the Ensemble Enterprise Monitor?  This provides a centralized "pane of glass" for a dashboard type display across multiple production.  Details of using it can be found here in the Ensemble documentation.

Regards,
Mark B-

Hi Ashish,

We are actively working with Nutanix on a potential example reference architecture, but nothing imminent at this time.  The challenges with HCI solutions, Nutanix being one of them, is there is more to the solution that just the nodes themselves.  The network topology and switches play a very important role.  

Additionally, performance with HCI solutions are good...until they aren't.  What I mean by that is performance can be good with HCI/SDDC solutions, however maintaining the expected performance during node failures and/or maintenance periods is the key.  Not all SSDs are created equal, so consideration of storage access performance during all situations such as normal operations, failure conditions, and node rebuild/rebalancing is important.  Also data locality plays a large role too with HCI, and in some HCI solution so does the working dataset size (ie - the larger the data set and random access patterns to that data can have an adverse and unexpected impact on storage latency).

Here's a link to an article I authored regarding our current experiences and general recommendations with HCI and SDDC-based solutions.

https://community.intersystems.com/post/software-defined-data-centers-sddc-and-hyper-converged-infrastructure-hci-–-important

So, in general, be careful when considering any HCI/SDDC solution to not fall into the HCI marketing hype or promises of being "low cost".  Be sure to consider failure/rebuild scenarios when sizing you HCI cluster.  Many times the often quoted "4-node cluster" just isn't ideal and more nodes may be necessary to support performance during failure/maintenance situations within a cluster.  We have come across many of these situations, so test test test.  :)

Kind regards,

Mark B

Hello Ashish,

Great question.  Yes, NetBackup is widely used by many of our customers, and the approach of using the ExternalFreeze/Thaw APIs is the best approach.  Also with you environment being on VMware ESXi 6, we also can support using VMDK snapshots as part of the backup process assume you have the feature in NetBackup 8.1 to support VMware guest snapshots.  I found the following link from NetBackup 8.1 and their support in a Vmware environment:  https://www.veritas.com/content/support/en_US/doc/NB_70_80_VE

You will want to have pre/post scripts added to the NetBackup backup job so that the database is frozen prior to taking the snapshot, and then thawed right after the snapshot.  Then NetBackup will take a clean backup of the VMDKs providing an application consistent backup.  Here is another link to an article on using ExternalFreeze/Thaw in a VMware environment:  https://community.intersystems.com/post/intersystems-data-platforms-and-performance-–-vm-backups-and-caché-freezethaw-scripts

I hope this helps.  Please let me know if you have any questions.

Regards,

Mark B-

Hi Jason,

We are working on a similar utility for writes now to support either a solely write or a mixed read/write workload.  I hope to have it posted to the community in the next few weeks.

Kind regards,

Mark B-

Hi Jason,

Thank you for your post.  We provide a storage performance utility called RANREAD.  This will actually use HealthShare/Ensemble (also Caché and InterSystems IRIS) to generate the workload rather than relying on an external tool to trying to simulate what HealthShare/Ensemble might be.  You can find the details here in this community article here

Kind regards,

Mark B-

Thanks Thomas.  Great article!  

One recommendation I would like to add is with VM-based snapshot backups, we recommend NOT including the VM's memory state as part of the snapshot.  This will greatly reduce the time a VM will be "stunned or paused" that would potentially bump up close to or exceed the QoS value.  Not including the memory state as part of the VM snapshot is OK for the database as recovery never relies on information in memory (assuming the appropriate ExternalFreeze and ExternalThaw APIs are used), since all writes from the database are frozen during the snapshot (journal writes are still occurring).

Hi Paul,

The call-out method is highly customized and depends on the API features of a particular load balancer.  Basically the code is to added to the ^ZMIRROR routine to call whatever API/CLI is available from the load balancer (or the EC2 CLI calls). 

For the  appliance polling method (the one I recommend because it is very simple and clean).   Here is a section from my AWS reference architecture article found here.  The link also provides some good diagrams showing the usage.

AWS Elastic Load Balancer Polling Method

A polling method using the CSP Gateway’s mirror_status.cxw page available in 2017.1 can be used as the polling method in the ELB health monitor to each mirror member added to the ELB server pool.  Only the primary mirror will respond ‘SUCCESS’ thus directing network traffic to only the active primary mirror member. 

This method does not require any logic to be added to ^ZMIRROR.  Please note that most load-balancing network appliances have a limit on the frequency of running the status check.  Typically, the highest frequency is no less than 5 seconds, which is usually acceptable to support most uptime service level agreements.

A HTTP request for the following resource will test the Mirror Member status of the LOCAL Cache configuration.

 /csp/bin/mirror_status.cxw

For all other cases, the path to these Mirror status requests should resolve to the appropriate Cache server and NameSpace using the same hierarchical mechanism as that used for requesting real CSP pages.

Example:  To test the Mirror Status of the configuration serving applications in the /csp/user/ path:

 /csp/user/mirror_status.cxw

Note: A CSP license is not consumed by invoking a Mirror Status check.

Depending on whether or not the target instance is the active Primary Member the Gateway will return one of the following CSP responses:

** Success (Is the Primary Member)

===============================

   HTTP/1.1 200 OK

   Content-Type: text/plain

   Connection: close

   Content-Length: 7

   SUCCESS

** Failure (Is not the Primary Member)

===============================

   HTTP/1.1 503 Service Unavailable

   Content-Type: text/plain

   Connection: close

   Content-Length: 6

   FAILED

** Failure (The Cache Server does not support the Mirror_Status.cxw request)

===============================

   HTTP/1.1 500 Internal Server Error

   Content-Type: text/plain

   Connection: close

   Content-Length: 6

   FAILED