I said that only because managing database size with SLM can be painful operationally: having to predict where the growth is going to be and coordinate a configuration change in advance of the new mapping range getting used by the application.  I did not mean to imply that anything bad happens when you do this.  In fact, if the growth of a global isn't bounded by some natural data lifespan, or some application-level archival process, then SLM is unavoidable with a sufficient rate of growth.  By planning in advance for the growth, though, and starting the largest expected globals mapped to their own databases, you might stave that off for a long time. 

Note: there's a little runtime cost to resolving SLM that doesn't exist for (whole) global mapping, but it's generally a noise-level cost unless you've generated a very complex set of mappings (more complex than you'd likely do as a manual configuration step)

From my perspective, the main reason to run integrity check is so that if you ever did have database degradation, you know that you have a backup that you can recover from.  I've seen too many disasters of the form that corruption is discovered that predates any available backup.  For use cases that would never recover from backup or mirrored copies or the like for disaster recovery, you might reasonably argue that integrity check isn't worth the effort/cost.   

(As a detail, just accessing a corrupted global won't  hang the system, but the system will hang if corruption causes a SET or KILL to fail in the middle of a multi-block update.)

Anyway, to your good thoughts about possible enhancements:

  • It turns out that one of my recent enhancements, as yet unreleased, did open up a possibiility of a "pointer block only" check (as a side effect of a different goal).  However, I don't think it's very valuable because pointer blocks make up a very small fraction of all blocks.  For typical patterns of subscripts, there's in the neighborhood 300-500 data blocks pointed to by a pointer block in 8KB databases, so you're talking about ~0.2% of all the blocks.  I don't think you'd draw any meaningful conclusion from a clean check that didn't include data blocks.  Don't be confused that most integrity check error starts with "Error while processing pointer block %d".  That's just the way integrity check works.  The vast majority of those are from a bottom pointer block and were found only because it read every data block under it to find the inconsistency.
  • We do actually have some protection against errors in the most recently written blocks (following a crash) via the Write Image Journal block comparison.  It's a totally different mechanism, but it is designed with the thought that when systems lose power, there's some history of drives dropping or corrupting the most recent writes, despite promises that they had already succeeded (via our very careful use of fsync() and similar mechanisms).
  • About piggy-backing on increment change tracking, it's an interesting idea, but again I worry that many of the failure modes that lead to corruption wouldn't necessarily get uncovered, and so it doesn't give the guarantee you need from integrity check in order to know that a backup image could be relied upon in a disaster.  

I think integrity check isn't the primary driver of that architectural decision, but it might be part of the consideration.  Any single database is constrained to a max size of 2^32 blocks, so 32TB for standard 8KB block size.  There's practical reasons not to go anywhere near that high: backup/restore and other operational tasks on a single database may be more onerous,  AIX/JFS2 has a 16TB file limit anyway, integrity check has less ability to be parallelized if the huge database is also primarily a single global, (and if you're running older versions there's a couple bugs involving databases that have more than 2^31 blocks, all fixed in latest maintenance kits).

Given these and other considerations, I believe most sites shoot for max database sizes somewhere between 2 and 10 TB.   So for 100TB we're talking about a few dozen databases.   You'd hope that much data, especially if it's largely in active use, is spread over a significant number of different globals (e.g. many tables and their indices).  Ideally you use global mappings in anticipation of such huge growth to organize the globals into databases and as much as possible avoid the need to use subscript level mapping (SLM) to manage growth of a single global across multiple databases.  If growth is unbounded though (ie this isn't data that sort of data that can eventually be moved to some separate archive structure) then subscript level mapping to map across these dozen or more databases becomes inevitable. 

As for running integrity check on that much data, it will take some substantial time and you need to find the balance of how frequently you want to run it, how much storage bandwidth is reasonable for it to consume, and whether you can run it on an offline copy.  Since the other factors I mentioned already put you into have a multitude of separate databases (with any giant globals spread over some number of them via SLM), integrity check will be able to be well parallelized.

Short answer: yes, you can certainly do this if you want to and the result is valid.  The main downside, in my opinion, is that the backup is then dependent on more technology, so there are more things that could go wrong.  More on that later.

If you're going to to this though, you really don't want to end up with Online Backup as your backup solution.  The problem with online backup is not consumption of resources, but time to restore,  I thought you were going to say you wanted the DR system so that you could shut it down for a couple hours while you take a cold external backup.  That would be a pretty good reason to do this.   

Since mirrored databases record their journal location inside the database, they intrinsically know from what journal file they need to "catch up" (the mirror checkpoint info).  Like all the usual backup solution, the result is not transactionally consistent in and of itself, but requires journal restore following backup restore to get to a transactionally consistent state. Mirroring makes this easier via the aforementioned checkpoint and the automatic rollback as part of becoming primary.  Of course it's the mirror journal files, not the DR's own journal files that will be used for this, but they live in the same directory, so if you just back that up in the same backup, you'll have the right stuff if it ever came to restoring this.

Now more about those downsides.  Backing up a replica means that you are subject to any problems with the replication.   For example, if a database on the DR had a problem and we had to stop dejournaling to it, that could mean your backup isn't good.  You'd worry a bit that you didn't notice because nobody is running on the DR system.  Or if you add a database to the primary but forget to add the same to the DR, your backup wouldn't have it.  These aren't meant to say this is a bad idea, but it is a consideration.   You want to think a bit about what you're trying to protect against.  You're talking about having a DR, so if you're restoring backup it means that something went wrong with both the primary and the DR.  So is the backup of the DR good in that situation?   If both are in the same physical location and your backing up in case that location is destroyed, then you're protected.  Or if you're backing up to handle the case of errant/malicious deletion of data, then you're protected.  

I don't know what your situation is with the main server, but I'd be curious how the system architect expects backups to take place and how long a backup of the disks are expected to take.  With a large global buffers, ExternalFreeze() can be workable in some application environments even if the freeze will last many minutes. If your operating environment is such that good backups are an absolute must, you might be better off investing in getting external backup working over there.

Ah, I think we found the confusion!  Canonical number and internal type are different concepts.  A canonical number can have internal string type.  An internal numeric type (int, float, double) will always be canonical.  What do you want your assert to say if your method did this...

 set $p(canonicaldata,",",2)=+$p(data,",",2)
 set test.Amount=$p(canonicaldata,",",2)

Now test.Amount is canonical, but also a string so

>w test.Amount=0.1,!,test.Amount=.1,!,test.Amount=".1"
1
1
1

What should your assert method say about that?  OK or NOT OK.  If OK, then you want it to test that actual=+expected.  If not OK, then you want one of the tricks that breaks this abstraction

Ah, I think we found the confusion!  Canonical number and internal type are different concepts.  A canonical number can have internal string type.  An internal numeric type (int, float, double) will always be canonical.  What do you want your assert to say if your method did this...

 set $p(canonicaldata,",",2)=+$p(data,",",2)
 set test.Amount=$p(canonicaldata,",",2)

Now test.Amount is canonical, but also a string so

>w test.Amount=0.1,!,test.Amount=.1,!,test.Amount=".1"
1
1
1

What should your assert method say about that?  OK or NOT OK.  If OK, then you want v=+v.  If not OK, then you want one of the tricks that breaks this abstraction

This is just definitional.  By "fail" I meant generate an assertion failure and it will do so for any canonical number if it happens to be stored internally as a string.  You've recently been saying this is what you want so I accept that.  This is going full circle again, but on the off chance that this is helpful to you or someone else, I'll take one last shot at explaining why I think that definition is not desirable.  Consider I write the following method 

ClassMethod foo() As %Float {
  set x=1.1 ; x is a number in canonical form
  set $piece(a,",",1)=x
  ...  other stuff ...
  quit $piece(a,",",1)
}

This method is perfectly correct in returning a floating point number.  It will also be in canonical form, so that it will test as = against any other canonical copy of 1.1 that you have.  But your assertion code will say the return value of my method doesn't equal 1.1 because it happens to internally have string type.  You would tell me that I should change my code to return +$piece(a,",",1) instead, but that is strictly not necessary.  The difference is only visible if you break the typeless abstraction layer and find a trick (like you've done) to peek into the internals.

You can certainly define your requirement to be stricter than this as you have and say that you want to require that the number would act as a number in one of the special functions that can tell the difference ($LB, $ZH, $ZB(), dyn arrays).  That's a fine definition, but it is special.  So it comes down to where you check this assertion.  Most COS programmers I know would not use the unary + in my method; rather they would use the unary + upon passing that value to one of aforementioned special functions.  

The definition I thought you were originally going for (when you liked sorts after) would be to accept any number that will evaluate as = to a copy of itself that had been passed through arithmetic operators, and for that the answer is to test value=+value.  (Side note: v=+v is better for this than sorts after $c(0) because it is invariant and meets my definition for things like "1111222233334444555566667777".)

I promise this is the last thing I'll say on this topic :) But..

1. This has different results than John Murray's sorts-after suggestion that you originally liked so much.  And now that I understand what you're doing, I too like that suggestion much better (just make sure the local collation is what you want) since it at least plays by the COS rules.  The difference is that the method above will fail numbers in canonical form just because they happen to have string type under the covers.  John's suggestion will properly pass all canonical numbers regardless of how they came to be.

2. For anyone who might come along later and encounter this answer, we should warn them that this is for Sean's highly specialized purposes, relies on internal implementation details that may change, and in general is specifically intended to break an abstraction layer that COS otherwise provides.

Hi Sean,

OK. I don't know of any direct way to access a variable's type.  Last little bit of food for thought...

Even if there were such a function, though, I'd consider it an internal detail that wouldn't necessarily be reliable.  Take as a trivial example 'set x="1234",x=x+0'.  Today, under the covers, x starts out as a string and then changes to an integer when it gets assigned the result of the addition operation.  You could imagine a future where a compile- or run- time optimization notices that it can just leave x unchanged as its string type 1234.  This is entirely an implementation detail and the optimization wouldn't violate any rules of the language.  Note that in the case of "set x="0.5",x=x+0", we would be obligated to leave x as having value ".5", not "0.5" due to the canonicalization rules, but even then we're not obligated to internally make it a floating point type rather than a string type. 

Would we ever really do this?  I don't know.  Unfortunately because there are things like $LB and $ZHEX that expose bits of these internal details in some fashion, you'd worry about compatibility implications.  But fundamentally, the internal type is just a detail for the Caché virtual machine to manage internally in doing whatever it needs to do to present the typeless COS language to the application.

Sean, I think your post reveals a couple misunderstandings that relate to this problem.  Let me comment on a couple, though at this point, I'm not sure how helpful I'm being to you...

If "1.5"=1.5 is true, then arguably "0.5"=0.5 should also be true, but it is not. This means that developers should be wary of automatic equality coercion on floating point numbers.

It's very important to understand what's going on here because it's central to your question.  "1.5"=1.5 because 1.5 is a number in canonical form.  "0.5" does not equal 0.5 because 0.5 is a numeric literal, and so that literal 0.5 gets canonicalized before being evaluated in the equals.  This is exactly expected and well-defined and not really arguable.  Literals are one thing, but programs are going to most likely get both sides of the equality from some calculation, string extraction, or user input.  If one side of the equality was either a numeric literal or came through some numeric operation, then it is canonicalized, whereas the other side may or may not be, thus possibly failing the equality check unless you explicitly use the unary +. 

To make things a little more interesting, a persistent object will automatically coerce a %Float property to a true number value when saved. That's fine, but what if the developer is unaware that he / she is assigning a stringy float value and later performs a dirty check between another stringy float value and the now saved true float number. The code could potentially be tripped up into processing some dirty object logic when nothing has changed.

I understand exactly what you're saying here, but I want to make sure that this behavior doesn't seem mysterious.  All that's going on here is that saving an object invokes %Normalize for all the object properties before saving.  You can do the same any time you want if you have a need to do so.  Remember though that COS is a typeless language so developers should absolutely NOT expect to need to manage the type of their data.  Consider that I store an integer as second comma-delimited piece of a string.  Now I have a %Integer method where I'll return that piece.  All is well and I do not need to use the unary +.  However, your sample assert method would generate a false positive failure because the number I returned in this way internally has string type.  That's not correct though, and you should not be writing code to try to expose the internal type of local variables.  The fact that certain special operations must expose the internal type (like the internal $listbuild structure, $zhex, and this dynamic array typing stuff) is a detail specific to those particular functions and shouldn't be considered a backdoor to imposing types on COS, which is typeless.  (BTW, I'm not 100% convinced that it's correct for "1" to become a string in these dynamic arrays, but I'm not going to get into that!)

If I can interpret your goals more generally, it sounds like you're trying to impose a coding convention that at certain places in your application, you want certain value to have been already normalized through the appropriate normalization for their datatype class, so that evaluation with the = operator can be used for logical equality.  You're using %Float as a specific example of that which is interesting in that it gets into how the language canonicalizes number.  But, one could easily imagine wanting the same thing for any arbitrary data type for which only the %Normalize method will do.  If that's what you're really after, then you could easily write an AssertNormalizedValue(value,datatype) which generates an asssertion failure if value'=$classmethod(datatype,"%Normalize",value)... or something like that.  

See my other comment above, but I don't think relying on what the dynamic array implementation picks for a type to convert to is a great idea. I'd like to see you find a solution in the core of the (typeless) language. If you really are just trying to implement AssertNumericEquals(actual,expected), then that's simply 'if +actual'=+expected { FAILED }'.  This will pass any value 'actual' that would evaluate in an arithimatic operation as the value 'expected' would.  Similarly, if you are trying to implement AssertEqualsCanonicalNumber(actual,expected), then it's 'if actual'=+expected { FAILED }'. That one will pass only the value 'actual' if it exactly is the canonicalized expected value (and thus could be compared to that number with the = operator).  If you want AssertIsCanonical(actual), that's 'if actual'=+actual { FAILED }'.  That one, of course will pass any number in its canonical form.

I'm not sure that I have a precise definition of what you are trying to achieve.  If you can define it, I might be able to help more.  However, there is some confusion in your example that I think needs clarification.

What you're dealing with here is the rules about canonical numbers.  (x=+x) will indeed evaluate whether a number is in canonical form because the equals operator test for exactly equal strings and the unary + converts a value to a number in canonical form.  The reason your first example above returns true is just that you set x equal to a numeric literal, so it got converted to canonical form before it even got set into the variable.  (if you look at the value in x, it would not have had a leading zero)

If you haven't read it before, this portion of the doc (along with the linked references) is a pretty good treatment of this subject. http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=...

 

String-to-Number Conversion

If you log the error with ^%ETN (as in DO BACK^%ETN, or LOG^%ETN, or exceptionobject.Log()), the SETs to the ^ERRORS global are done with the transaction "suspended", so that it does not roll back.  In the future, we will be exposing this functionality for use in applications.  These get recorded in the journal so that they are recovered upon a system crash or restored in a journal restore, but they are omitted from rollback.  As others have said, ^%NOJRN is not the answer because it is ignored for mirrored databases.

I hesitate to comment on this because you know the answer, but it seems that if you're trying determine if a value is a number in canonical form, it's hard to beat testing that (x=+x).  

I don't think we should be so excited about the suggestion for sorts after $c(0), because that introduces dependencies on the the current local collation strategy.  Whatever answer you choose, I think you should require it to be invariant 

Again, "1.0" is not a canonical number; "2.2" is.  Both are valid numbers, but only one is in canonical form.  So exactly what you quoted here is the reason for this behavior.

Since both are valid numbers, you don't have to use + for any function that evaluates them as numbers or as boolean.  You do have to use + any time you desire conversion to canonical form (like equality, array sorting, etc).