Replies by Ray Fucillo for InterSystems Developer Community

Ray Fucillo · Jun 7, 2017

This behavior looks correct to me (but it's tricky). The reason is that the string "2.2" is a number in canonical form, so it collates with the numeric subscripts. "1.0" is non-cananonical, so it's stored as a string subscript. Sorts after operation is all about resolving subscript ordering. You can convince yourself of this behavior by actually setting these as subscripts in a global or local variable and then ZWRITE'ing it.

The same reasoning is why "2.2" = 2.2 evaluates true but "1.0" = 1.0 is false.

Note, of course, that numeric conversion will happen as part of any arithmatic operation so "1.0" still functions as 1 in such operations.

Ray Fucillo · Apr 5, 2017

If you have a true moment-in-time snapshot image of all the pieces of Caché (databases, WIJ, Journals, installation/manager directory, etc), then restoring that image is, to Caché, just as though the machine had crashed at that moment in time. When the instance of Caché within that restored image starts up, all Caché's usual automatic recovery mechanisms that give you full protection against system crashes equivalently give you full protection in this restore scenario.

Whether a given snapshot can be considered crash-consistent comes from the underlying snapshotting technology, but in general that's what "snapshot" means. The main consideration is that all of the filesystems involved in Caché are part of the same moment-in-time (sometimes referred to as a "consistency group"). It's no good if you take an image of the CACHE.DAT files from one moment in time with an image of the WIJ or Journals from another.

Most production sites wouldn't plan their backups this way because it means that the only operation you can do on the backup image is restore the whole thing and start Caché. You can't take one CACHE.DAT from there and get it to a consistent state. But, in the case of snapshots of a VM guest, this does come up a fair bit, since it's simple to take an image of a guest and start it on other hardware.

Let me know if you have questions.

Ray Fucillo · Apr 5, 2017

You will start the restore at that file that was switched to (your .003 file), and that file contains metadata that allows us to find the oldest open transaction to rollback. The rollback as part of journal restore will scan backwards in the journal stream to find it if needed. If you need to know what that oldest file will be, you can get it via the RequiredFile output parameter of ExternalFreeze() or by calling %SYS.Journal.File:RequiredForRecovery() before calling ExternalFreeze(). Again though, you don't need to start the journal restore from here, just have it (and the journal.log to find it) available at restore time. So, if you're backing up and restoring all journals that are on the system, this basically takes care of itself.

Ray Fucillo · Apr 4, 2017

Upon return from ExternalFreeze(), the CACHE.DAT files will contain all of the updates that occurred prior to when it was invoked. Some of those updates may, in fact, be journaled in the file that was switched to (the .003 file in your example), though that doesn't really matter for your question.

BUT, you still need to do journal restore, in general, because the backup image may contain partially committed transactions and journal restore is what rolls them back, even if the image of journals that you have at restore time contains no newer records than the CACHE.DAT files do. This is covered in the Restore section of documentation, which I recommend having a look at: http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=...

There is an exception to this, and that is if you are a crash-consistent snapshot of the entire system, including all CACHE.DAT files, the manager directory, journals, and the WIJ. In that case, all the crash-consistency guarantees that the WIJ and journals confer mean that when you start that restored image, the usual startup recovery actions will take care of any required roll forward and roll back from journals automatically. In that scenario with crash-consistent snapshots, ExternalFreeze() wasn't even needed to begin with, because crash-consistent snapshot is by definition good enough. However, ExternalFreeze() is typically used for planned external backups because it allows you to restore a subset of databases rather than requiring restore of the entire system.

Ray Fucillo · Feb 11, 2017

%Library.Device class has GetMnemonicDirectory() and GetMnemonicRoutine()

Ray Fucillo · Dec 28, 2016

A few comments:

1. Similar to what Alexey said, any time you're using a mix of data that is journaled and non-jounaled but also not temporary (will survive a restart), you have to remain keenly aware of recovery semantics. After a crash and restart, the journaled data will be at a later point in time than the non-journaled data. It's only pretty special cases where data is meant to persist across restarts, but doesn't really have to be as up to date as the rest for the integrity of the application. This needs to be considered in the development cycle.

2. If using non-journaled databases, be aware of their recovery semantics; it can be a bit non-intuitive. Transactions are journaled there for satisfying rollback at runtime, but that journal information is not used during journal recovery or rollback at startup so transaction there are not atomic or durable (even if in synchronous commit mode) across restarts. What this does get you is that all data in all the journaled databases are recovered to the same moment in time after a crash, regardless of whether they were in transaction or not.

3. Mirrored databases ignore the process ^%NOJRN flag discussed in e. (though it is honored for non-mirrored databases on mirror members).

Ray Fucillo · Oct 31, 2016

It's important to start by saying that mirroring already handles this automatically for the most common cases, and it is more the exceptional case that would require the original failover members to be rebuild after no-partner promotion. As long as the original members really did go down in the disaster and the DR member is relatively up to date (a few seconds or even a few tens of seconds of data loss), then it is usually the case that the original members can reconcile automatically when they reconnect (as DR asyncs) to the new primary. That's because the state of the CACHE.DAT files on disk did not advance past the journal position from which the DR member took over. This is not a guarantee, but it is the case in most disasters for which this intends to cover.

The features Bob mentioned to survey other reachable members automatically helps make sure that the DR member becoming primary has all the data that is possibly available to it at the time (while not preventing it from becoming primary if it cannot).

The main case where this automatic reconciliation cannot happen is if the failover member(s) got isolated but did not crash, or at least did not crash right away. In that case, if you choose to promote the DR member and accept this larger amount of data loss in the process, then indeed you expect the on-disk CACHE.DAT state to have advanced into a part of the journal that the DR member never had (and probably cannot get)

Regarding the enhancement you mention, there are no plans at the moment, though it's certainly a reasonable idea.

Ray Fucillo · Sep 2, 2016

And for a little description of how these are relevant see http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=...

http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=...

Ray Fucillo · Aug 18, 2016

This is the latest maintenance release and I know of no bug like this there, so this needs to be investigated to understand the error. To your original question, there is nothing special you need to do to run SQL SELECT against an async mirror member, even when its databases are read-only.

Ray Fucillo · Aug 18, 2016

There were problems like this in query compilation in some versions? What error do you get and what version are you using?

Ray Fucillo · Jun 20, 2016

The biggest thing you want to do is use three-argument $order to collapse from two global references to one: $ORDER(^[Nspace]LAB(PIDX),1,Data)

In regards to the question about setting BBData or other small variants like that, it may very much be data-dependent and depend on what happens later in the loop that you haven't showed us. But generally speaking if you're going to calculate the $p more than once, you probably do want to store it in a (private) variable.

You can certainly combine multiple conditions with and and or operators (&& and ||) if that's what you're asking. Also, constructs like $case and $select can help (in case you haven't encountered them before).

Ray Fucillo · Jun 9, 2016

Yes, for sure. A global node can have raw binary data as its value (and also as a subscript in most global collations, though length is substantially restricted in subscripts). Also, $listbuild can have binary data as list elements. If you're storing it as a property of persistent class, you can use %Binary as the type.

Ray Fucillo · Jun 9, 2016

Isn't the algorithm you describe going to lead to data discrepancies. In particular, you have something like 1 in 2^32 chance of missing an update because it hashes to the same crc value. Maybe this was already obvious to you and that it's okay for some reason, but thought I should say something just in case...

Of course you could use a cryptographic hash function, like $system.Encryption.SHAHash(), but that takes substantial computation time, so you might not be any better off than you would be by actually opening the object and comparing the values directly. It sounds like either way you're already resigned to traversing every object in the source database. (If the source database is growing then this entire approach won't work indefinitely of course)

Ray Fucillo · Jun 7, 2016

Alex, I agree with you that I wouldn't recommend using this function for any of the use cases you mention.

Laurel mentions one use case below, where you wish to preserve the state of a DR or backup before performing an application upgrade or data conversion so that it can be viable as a failback if something goes wrong.

Another case (which we mention in documentation) is if you are performing some maintenance activity on the primary host, particularly a virtual host, whereby you expect that it might interrupt connections to the backup and arbiter and you'd rather not have disconnects of failovers occur as a result. This use case raises some questions, like why not just fail over to the backup before that maintenance, but we'll leave that aside.

There's also the principle that it's good to have a way to shut things off temporarily without needing to dismantle the configuration or shut down the instance entirely. That can be handy in troubleshooting.

Ray Fucillo · Jun 7, 2016

In the mirror monitor you would see the state of the member as Stopped, which exactly defined as stopped by an admin; not connected for other reasons is defined as a different state, like Waiting or Crashed, and this is no change. With this change, we would add a cconsole message when we skip starting mirroring on instance startup start due to it being stopped by an admin.

Ray Fucillo · Jun 3, 2016

At a fundamental level the worry that you attribute to ObjectScript is not really particular to ObjectScript or any other language, but rather an issue of parallel vs serial processing. The fundamental issue you're raising here is that when programming at the level of individual database accesses ($order or random gets or whatever) one process is in a loop doing a single database operation, performing some (perhaps minimal) computation, and then doing another database operation. Some of those database operations may require a disk read, but, especially in the $order case, many will not because the block is already cached. When it does need to do a disk read, the process is not doing computation because, well, this is all serial; the next computation depends on what will be read. Imagine the CPU portion of the loop could be magically minimized to zero; even then this process could only keep a single disk busy at a time. However, the disk array you're talking about achieves 50,000 IOPS not from a single disk, but from multiple disks under some theoretical workload that would utilize them all simultaneously.

Integrity check and the write daemons are able to drive more IOPS because they use multiple processes and/or asynchronous I/O to put multiple I/Os in flight simultaneously.

Where language, programming skill, and ObjectScript come in to play is in how readily a program that wishes to put multiple I/Os in flight can do so. ObjectScript enables this, primarily, by giving you controls to start multiple jobs (with the JOB command) and good mechanisms to allow those multiple jobs to cooperate. For a single process, ObjectScript provides $prefetchon to tell the Cache kernel to do disk prefetching asynchronously on behalf of a single process, but that is restricted to only help in sequential-access-type workloads.

Programming constructs that work at a higher level of abstraction (higher than a single global access) may do some parallelization for you. Caché has some of these types of things in many different contexts, but %PARALLEL in SQL, and the work queue manager come to mind. (In SQL Server, you are already programming at this higher level of abstraction and indeed it's not surprising that there's parallelization that can happen without the programmer needing to be aware of it. Under the covers though, this is undoubtably implemented with the sorts of programming constructs I've described: multiple threads of execution and/or asynchronous I/O.)

Of course, how readily a task can be adapted to being parallelized is highly specific to what the task is doing and that is application-specific. Therefore there are tasks for which this disk array has far more capability than an application may use at a given user load. However, even an application for which no single task would ever utilize any where this much disk capability, when scaled up to tens of thousands of users, it may indeed want a disk array like this and make use of it quite naturally. Naturally, not by virtue of the program being written to parallelize an individual tasks, but by having thousands of individual tasks running in parallel.

Ray Fucillo · May 12, 2016

There is no utility to do this. You're right that to create such a mechanism is just a matter of manipulating the right bits and bytes just so, but it does mean that you'd lose the guarantee that these are identical copies, so we haven't created one. The only context in which anything like this is available is the special case of converting shadow systems to mirror systems where we do have a migration utility that doesn't require completely resynchronizing the databases.

Ray Fucillo · May 9, 2016

This is pretty clearly a mistake in the definition of the Search custom query. We will look into the history a bit more and correct it. Since the (custom query) Execute method defines the expected arguments, invocation through a resultset works. Beyond the understandable confusion you had, Mike, it makes sense that this could cause other things not to work like Dmitry illustrates.

You might want to take a look at the List query in %SYS.Journal.Record. That's a much nicer interface for searching a journal in my opinion. Also, I suspect you'll find it performs better for most use cases.

Ray Fucillo · Apr 8, 2016

Hopefully someone will chime in with real-life numbers, but I thought it would be helpful to take you through the principles at play to guide your thinking...

1. With any mirror configuration that is going over a WAN (for failover or just DR), you're going to need to ensure sufficient bandwidth to transfer of journals over the network at the peak rate of journal creation. This is application- and load- specific of course, so this is derived from measuring a reference system running that application. It's important to base this on peak journal creation, not average journal creation rate, giving plenty of room for spikes, additional growth, etc.

2016.1 introduces network compression for journal transfer and that can substantially reduce bandwidth (70% or more for typical journal contents). Although it can add a computation latency to the latency you'd consider in #2 below, if you're already going to use SSL encryption, compression may actually save some latency compared to SSL encryption alone. See documentation on Journal data compression.

2. With failover members in different data centers, latency can be a factor for certain application events. Specifically it's a factor when an application uses synchronous commit mode transactions or journal Sync() API to ensure that a particular update is durably committed. That requires a synchronous round trip to the backup, which of course incurs any network latency. This is discussed under Network latency considerations

3. You'll need a strategy for IP redirection when failover occurs. For an intro to the subject, read Mirroring Configurations For Dual Data Centers and Geographically Separated Disaster Recovery. Then see Mark Bolinsky's excellent article here on the community https://community.intersystems.com/post/database-mirroring-without-virtu....

4. You'll need a location for the arbiter that is in neither of the two data centers as discussed in Locating the Arbiter to Optimize Mirror Availability

Ray Fucillo · Jan 26, 2016

I just want to point out that the class thing that folks have been mentioning isn't magical. It's that RegisteredObject includes %systemInclude and most classes have that in their heirarchy. I believe that nothing is implicitly included in an Abstract class...