Cheap object hashing?

Problem: Caché instance connects periodically to an Oracle database via jdbc and downloads data. Sometimes data on a remote server would change.

Currently the loader code follows this logic:

  1. Load next row
  2. Check if the row exists (in Cache) or not, depending on that instantiate new object or open an existing one
  3. Set object properties from the row data
  4. Save row

But I don't want to reload all the data every time, so I got this idea:

  1. Load next row
  2. Append all row data to string with some separator and calculate CRC32 hash with $zcrc
  3. Check if the row exists or not, if not then instantiate new object, otherwise get stored hash and compare with calculated hash. If hashes are the same then go to the next row, if not open an existing object
  4. Set object properties from the row data
  5. Save row

But I'm not really sure if CRC is the best solution, maybe there are some better approaches?

Maybe I can hash the row object (it's EnsLib.SQL.Snapshot, but i would also be interested in a generic strategy to get a hash for object of any class)?

I don't have a write access to the Oracle database and there is no information about "last change date" to get.

  • + 1
  • 0
  • 782
  • 7
  • 2

Answers

The internal structure of EnsLib.SQL.Snapshot allows an easy direct access to the current row as a Listbuild using EnsLib.SQL.Snapshot.GetRowList(). This could be used to calculate a hash, but you would need to manage this in the business host's code. Because the amount of data to transport is not in all cases small I would suggest or evaluate the following approach:
 Obviously it is not that easy to judge whether a row from an external db has changed or not. The easiest approach if possible at all in such scenarioes would be to ask for an additional field in the external database table which gets filled on the external database side with an last update timestamp or a version id (like Caché has for objects). Then the SQL statement could honor only the new/changed records. This would minimize the amount of data to transport compared to the orignal proposal.

> EnsLib.SQL.Snapshot.GetRowList().

Thank you.

> additional field in the external database

Would have been nice, but not happening unfortunately.

Isn't the algorithm you describe going to lead to data discrepancies.  In particular, you have something like 1 in 2^32 chance of missing an update because it hashes to the same crc value.  Maybe this was already obvious to you and that it's okay for some reason, but thought I should say something just in case...  

Of course you could use a cryptographic hash function, like $system.Encryption.SHAHash(), but that takes substantial computation time, so you might not be any better off than you would be by actually opening the object and comparing the values directly.  It sounds like either way you're already resigned to traversing every object in the source database. (If the source database is growing then this entire approach won't work indefinitely of course)

 

Yes, I'm aware of that.  There is not that much records (hundreds of thousands tops). Still, decided to do a comparison:

ClassMethod Time(count = 100000000)
{
    Set str = "111111111111111111111111111111111111111111111111111111111"
    Set time1 = $P($h, ",", 2)
    For i=1:1:count {
        s a = $zcrc(str, 7)
    }
    Set time2 = $P($h, ",", 2)
    For i=1:1:count {
        s a = $System.Encryption.MD5Hash(str)
    }
    Set time3 = $P($h, ",", 2)
    For i=1:1:count {
        s a = $System.Encryption.SHA1Hash(str)
    }
    Set time4 = $P($h, ",", 2)
    For i=1:1:count {
        s a = $System.Encryption.SHAHash(256, str)
    }
    Set time5 = $P($h, ",", 2)
    Write !,"CRC: ",time2-time1,!,"MD5: ",time3-time2,!,"SHA1: ",time4-time3,!,"SHA2: ",time5-time4
}

It outputs the following results:

CRC: 14
MD5: 72
SHA1: 119
SHA2: 140

Follow up on that.  Can I store MD5/SHA in a database as is or do I need to convert it to base64 first?

Yes, for sure.  A global node can have raw binary data as its value (and also as a subscript in most global collations, though length is substantially restricted in subscripts).  Also, $listbuild can have binary data as list elements. If you're storing it as a property of persistent class, you can use %Binary as the type.

Comments

Here's CRC hashing method (accepts any number of arguments):

ClassMethod hash(in...)
{
    set delimiter = $c(255)
    set str = delimiter
    for i=1:1:in {
        set str = str _ in(i) _ delimiter
    }
    return $zcrc(str, 7)
}

I think your code is vulnerable to a <MAXSTRING> error in the case of records with a large number of long values.

How about this instead, which might be a bit faster too?

set crc = 0
for i=1:1:in {
 set crc = $zcrc($char(i#256)_in(i), 7, crc)
}
return crc

Prefixing the input string to each $zcrc call with a character derived from the argument position number is intended to prevent us getting an unchanged CRC in the event that a substring has been removed from the beginning of one argument and appended to the previous argument (or removed from the end of one argument and prepended to the next argument)

The #256 is probably overkill because (a) you might be on a Unicode instance of Caché and (b) it might not even be possible to pass more than 255 arguments to the method (I haven't investigated).

A maximum of 254 arguments can be passed to a method:

ClassMethod Gen(count = 255)
{
    set out = "w ..hash("
    for i=1:1:count-1 {
        set out = out _ "1,"
    }
    set out = out _ "1)"
    x out
}

ClassMethod hash(in...)
{
    set crc = 0
    for i=1:1:in {
        set crc = $zcrc($char(i#256)_in(i), 7, crc)
    }
    return crc
}