Managing with heavy databases

Question

Question

Dmitry Maslennikov · Nov 12, 2018

#Backup #Databases #System Administration #Caché

Let's say you have about 100TB of data in multiple CACHE.DAT. The biggest one is about 30TB but mostly more than 1TB. You have limited time for maintenance during a day, and it is only a few hours at night. You have to check Integrity as much often as possible. And of course backup it.

How would you do it?

Discussion (2)2

Log in or sign up to continue

Stefano Cairoli · Dec 5, 2018

With the assumption that running a full integrity check on the primary, Live instance is not really an option here, my approach would be something along these lines:

- backup of your primary instance using your preferred third party tool

- automate the restore of the backup into a separate environment

- run the integrity check in this second server

I see a couple of advantages in such a scenario:
you can verify that your backups are indeed valid and can be restored, quite a good felling in case of actual needs at some point in time.

As an integrity check is quite I/O intensive, you would move all those extra IOs away from your Live storage and only affect a separate set of disks.

Our documentation includes a section on how to script integrity checks:

"Checking Database Integrity Using the ^Integrity Utility"
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY...

2 0

Alexey Maslov · Dec 5, 2018

A customer of ours is running a production system of two mirrored DB servers and 13 APP ones, running 24x7 with possible downtime <= 2 hours once a month.

Summary production DB size is ~ 4TB. Full backup takes ~ 10 hours, its restore ~ 16 hours. Therefore incremental backups are scheduled nightly, and full backup - weekly; to avoid performance drawback, Mirror Backup server is assigned to run all this activity.

As to integrity check, it turned to be useless since the DB size had overgrown 2TB: weekend had become not enough to run it through, so it was disabled. It doesn't seem to be a great problem because:

Enterprise grade SANs are very reliable, the probability of errors is very low.
Let's imagine that integrity error occurred. Which database blocks would be most probably involved? Those that are active changed by users' processes, rather than those that reside in "stale" part of the database. So, the errors would most likely appear on application level sooner than they might be caught with Integrity Check. What, do you think, would the administrator do in case of such kind of errors? Stop production for several days, running Integrity Check and fixing the errors? No, they would just switch from damaged Primary to (healthy) Backup Server. And what if Backup Server is damaged as well? As they reside on separated disk groups, this scenario means real disaster with the best solution: to switch to Async Backup which resides at another data center.

1 0