Backup procedures off of an Asynchronous Mirror Member

Question

Question

Steve Pisani · Feb 28, 2018

Hi,

A site has a Mirror set, and an Asynchronous DR. The disk drives for each of the instances are local disks and the databases are large (>2 TB)

The current backup strategy is started off as using Cache Backup and is still the case today, but the size and time it take t do the backup is obviously prohibitive.

The External Backup procedure requires a Freeze/Thaw of the Cache environment, with a copy of the environment executing between these 2 steps, however, being on local disks, there seems to be no way to quickly copy off the OS files and cache databases in time. This wouldn't be a problem if the data was under the control of some snapshot /mirroring technology which could perform a quick copy.

A Cache backup could be run off of the the Async Server, (which is satisfactorily caught up) and whilst it would take the same amount of time to complete, as it does currently off of one Sync mirror nodes, - at least -, it will not consume production system resources.

Are there any consideration we need to be aware of in order to take a successful backup off of the Async member ? I am aware that the backup should also contain journal files, cache.cpf and other OS files that make up the solution.

thanks,

Steve

Discussion (8)1

Log in or sign up to continue

Ray Fucillo · Mar 5, 2018

Short answer: yes, you can certainly do this if you want to and the result is valid. The main downside, in my opinion, is that the backup is then dependent on more technology, so there are more things that could go wrong. More on that later.

If you're going to to this though, you really don't want to end up with Online Backup as your backup solution. The problem with online backup is not consumption of resources, but time to restore, I thought you were going to say you wanted the DR system so that you could shut it down for a couple hours while you take a cold external backup. That would be a pretty good reason to do this.

Since mirrored databases record their journal location inside the database, they intrinsically know from what journal file they need to "catch up" (the mirror checkpoint info). Like all the usual backup solution, the result is not transactionally consistent in and of itself, but requires journal restore following backup restore to get to a transactionally consistent state. Mirroring makes this easier via the aforementioned checkpoint and the automatic rollback as part of becoming primary. Of course it's the mirror journal files, not the DR's own journal files that will be used for this, but they live in the same directory, so if you just back that up in the same backup, you'll have the right stuff if it ever came to restoring this.

Now more about those downsides. Backing up a replica means that you are subject to any problems with the replication. For example, if a database on the DR had a problem and we had to stop dejournaling to it, that could mean your backup isn't good. You'd worry a bit that you didn't notice because nobody is running on the DR system. Or if you add a database to the primary but forget to add the same to the DR, your backup wouldn't have it. These aren't meant to say this is a bad idea, but it is a consideration. You want to think a bit about what you're trying to protect against. You're talking about having a DR, so if you're restoring backup it means that something went wrong with both the primary and the DR. So is the backup of the DR good in that situation? If both are in the same physical location and your backing up in case that location is destroyed, then you're protected. Or if you're backing up to handle the case of errant/malicious deletion of data, then you're protected.

I don't know what your situation is with the main server, but I'd be curious how the system architect expects backups to take place and how long a backup of the disks are expected to take. With a large global buffers, ExternalFreeze() can be workable in some application environments even if the freeze will last many minutes. If your operating environment is such that good backups are an absolute must, you might be better off investing in getting external backup working over there.

2 0

score 1 · Answer 1 · 2018-03-01T13:35:42-05:00

Whether external backup or backup from a mirror (also asynchronous) or Caché backup you always have to identify
the point in time when your DB is logically consistent. What I think of is no open transactions, no open dependencies.

If you know that point in time you could separate your async mirror or shadow and run any backup from there.
Or just shut down your async server instance and run snapshots.
But there might also be a time gap between master and async server.
Once completed your async server can join again and catch up whatever time that may take.
The critical point is to know when the async server has reached consistency.
But that depends on the application.

score 0 · Answer 2 · 2018-03-01T20:52:33-05:00

So I guess that is the question...

Is it possible to identify a point-in-time on the Async Mirror that it is logically consistent ?

I could shut down the Async member before backing it up, restarting it, and then, have it re-join and catchup with the mirror set. I'm seeking confirmation that this process will leave me with cache.dat's that are in a consistent state, and if restored, could be use like any other backup, and accept play-forward journal files.

Would this also apply if I performed an online Cache backup on the async member without shutting it down, or is the mirror de-journal activity on the async member ignorant of the final passes in the Cache Backup process ?

It is impossible to determine a point-in-time logical consistency for the Async Member which is receiving mirror data from 5 busy Ensemble productions. I'm hoping shutting it down, or taking it 'off-line' in some controlled manner, would leave it in a state that could be backed up for later use.

Steve

score 1 · Answer 3 · 2018-03-04T21:05:53-05:00

Hi!

IMHO, I don't think this is application dependent at all. When we do the Freeze on one of the failover members, we don't care about what is running on the instance. Please notice that after you call Freeze, you snapshot everything, not only the filesystems where the database files are but also the filesystems where journal files, WIJ files and application files are. So, when you restore the back up, you are restoring a point in time that may or may not have a database file (CACHE.DAT) consistent, but also the journal and WIJ files that will make it consistent.

Also it is important to notice that Freeze/Thaw will switch the journal file for you and this is where transaction consistency will be kept. I mean, Caché/Ensemble/IRIS will split the journal at a point where the entire transaction and probably what is on the WIJ file is there and consistent.

After restoring this backup, you must execute a Journal Restore to restore all journal files generated after the Freeze/Thaw to make you instance up to date.

Unfortunately I can't answer about doing the backup on the Async node. At first, I believe there is no problem with it. You just need to be careful not forgetting the Async node exists and forget to apply patches and configurations you have done on the failover members so you have a complete backup of your system (application code and configurations included). But I don't know what happens when you execute the Freeze/Thaw procedure on the Async. Supposedly, freezing writing to the databases and new journal file creation would be performed on all cluster members, but the documentation is not clear about what "all cluster members" means. It is not clear if "all cluster members" includes Async Mirror Members.

My opinion is that Backup on an Async Member is not supported and may be problematic. For it to work, it would still have to freeze both failover members to have consistent journal files members on all nodes. So there would be no gain on doing it on the Async node. But that is only my opinion. Let's see if someone else can confirm my suspicious.

Kind regards,

AS

score 0 · Answer 4 · 2018-03-05T20:37:02-05:00

Thanks Ray.

I'm still hoping this site will improve the underlying network storage such that snapshots can be taken as backups, but at least now we are aware of the options/pitfalls with attempting to use an Async member for backups.

The site should be in a better position to make an informed decision.

Thanks -

Steve

score 0 · Answer 5 · 2018-03-28T18:27:42-04:00

Thanks Anzelem.

Its good to hear that backups off of the DR is working for your situation.

I'm concerned that the online backups proved problematic for you (and by online backups, I'm assuming you refer to External Freeze/Thaw() process). Be sure that, after ExternalFreeze() is called, that you copy not only folders where the database exist, but also the folders where journal files, WIJ files and you application files are.

A restore procedure would require all these be restored (which include the WIJ + Journal Files) to ensure the when Cache starts up after a restore, that databases are in an integral and consistent state from the time of the backup.

I know of lots of people successfully using the ExternalFreeze/Thaw on production systems.

thanks -

Steve

score 0 · Answer 6 · 2018-03-06T16:35:40-05:00

Hi again!

I was checking the documentation of ExternalFreeze() here and there is an option for not switching the journal file. The parameter is defaulted to 1 (to switch the journal file) but you can change it to 0. Maybe that would allow you to do the Freeze/Thaw on an Async mirror member without consequences. Maybe ExternalFreez() will do it without switching the journal file independently of what you pass to this parameter just because it's being called on an Async mirror member. The documentation is not clear though...

Maybe someone with more knowledge about the internals of this could clarify? I believe each CACHE.DAT file knows what was the last journal entry applied to it and, during a restore procedure, it could simply start in the middle of a journal file and proceed to the newer journal files created during/after the backup.

I would like to understand why we switch the journal file if, during a Freeze, all new transactions that can't be written to the database (because of the freeze) will be on the current journal file. A new journal file is created after the ExternalThaw() but all those transactions executed during the Freeze will be there on the previous journal file. It seems to me that switching the journal file serves no purpose since we always have to pick the previous journal file anyway during a restore.

Kind regards,

AS

score 0 · Answer 7 · 2018-03-28T11:21:27-04:00

Hi Steve;

I have encountered the same problem where these online backups are problematic on a production system - from the time it writes to disk to the time and it's copied out.

To start with - these are cold offsite backups taken only at a point in time - (NOT A DR SOLUTION). So backing up on the DR Async made more sense to us and we relieved the production a lot. I have actually restored these backups on other aut/training/tests environments and they are just as good.

So if for off-site backup purposes - no harm I have encountered doing that.

Regards;
Anzelem.