The 'unlimited' UNIQUE index

As you will know the size of a Global reference is limited to 511 encoded characters. see docs:
http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=GGBL_structure#GGBL_structure_maxsubscrlen

 

All UNIQUEness in Caché is affected by this limit. So is also any Index in Caché.
If you want to have an index over a long text string (>250 char) then using a hash code for indexing might be a good solution.
You will have a fair chance to create an ALMOST UNIQUE index. But there is no guaranty that it really is unique and
you have to verify your results. You it's a little bit of guessing.

Some time back (~2008) I was confronted with the requirement to have an EXACT + UNIQUE Index over a text property of
2000...5000 character preferable unlimited. Well unlimited is good for mathematic or physical theories. Nor for computing.
So for us large enough should fit. The solution I want to share is NOT unlimited, but rather large.

The idea:
Find some UNIQUE hash code for your text and used it for indexing.
To get a guaranty for uniqueness I cut any large text into index-able pieces number the puzzles  and store and index it.
Instead of the whole text only a (significant smaller) list of the text puzzles is stored and indexed.

In practice a text of about 2000 characters was transformed into $lb(457,8991,22,134,5434,23,11111,5624,9001)
The reduction is obvious and the "hash" is reversible. Just concatenate the text based on the index list.
A friendly side effect:
- the long text is stored away from the main global.
- eventual duplicated text is just stored once. This is depending on the way you cut your puzzles.

For my example here I took a fixed length. In reality it was a character sequence in XML.
For testing I'd recommend cutting smaller puzzles, that's easier to type and to verify.

The solution to assemble/disassemble the text has 2 pairs of methods producing equal results.

#1) toHashOBJ + fmHashOBJ that both work just with object and SQL

#2) %toHash + %fmHash touch the globals directly and are remarkable faster.
This is the original version from the benchmark

toHash takes any String and returns a $LB() of puzzle indices
fmHash takes a $LB() of indices and returns the original text

Needless to say that manipulation of indices and/or text store can cause any kind of confusion.

 

/// create a unique hash code for very long strings
Class User.uhash Extends %Persistent [ Final ]
{
/// take a useful part length for practical implementation but <300
/// Parameter PARTLEN = 500;
Parameter PARTLEN = 20;
Property Text As %String(COLLATION = "EXACT", MAXLEN = "");
Index UNIQUE On Text [ Unique ];
/// full object + SQL variant
ClassMethod toHashOBJ(Text As %String(MAXLEN="")) As %String(MAXLEN="")
{
  set hash=""
    for piece=0:1 {
    set part=$EXTRACT(Text,piece*..#PARTLEN+1,piece+1*..#PARTLEN)
    quit:part=""
    set hs=""
    &SQL(SELECT id into :hs from uhash where Text=:part)
    if SQLCODE<0 break   ;Huston we have a problem
    if hs="" {
       set obj=..%New()
           ,obj.Text=part
          ,tSC=obj.%Save()
       if 'tSC break    ; can't be
       set hs=obj.%Id()
    }
    set hash=hash_$LISTBUILD(+hs)
  }
  quit hash
}
ClassMethod fmHashOBJ(hash As %String(MAXLEN="")) As %String(MAXLEN="")
{
  set res=""
  for piece=1:1:$LISTLENGTH(hash) {
    set hs=+$lg(hash,piece)
    set obj=..%OpenId(hs)
    if obj set res=res_obj.Text
    else  write "invalid hash: "_hs,! break
   }
   quit res
  }
/// fast global access variant
ClassMethod %toHash(Text As %String(MAXLEN="")) As %List [ SqlName = %toHash, SqlProc ]
{
  set hash=""
  for piece=0:1 {
    set part=$EXTRACT(Text,piece*..#PARTLEN+1,piece+1*..#PARTLEN)
    quit:part=""
    set hs=$ORDER(^User.uhashI("UNIQUE",part,""))
    if hs="" {
      set hs=$INCREMENT(^User.uhashD)
         ,^User.uhashD(hs)=$LISTBUILD(part)
         ,^User.uhash("UNIQUE",part,hs)=""
      }
    set hash=hash_$LISTBUILD(+hs)
   }
  quit hash
}
/// fast global access variant
ClassMethod %fmHash(hash As %String(MAXLEN="")) As %String(MAXLEN="")
       [
SqlName = %fmHash, SqlProc ]
{
  set res=""
  for piece=1:1:$LISTLENGTH(hash) {
    set hs=+$lg(hash,piece)
    if (hs>0)&&$DATA(^User.uhashD(hs),value) set res=res_$li(value)
    else  write "invalid hash: "_hs,! break
  }
  quit res
}
}

 

 

To try it out this tiny class may serve as an example how to use it.

 

Class User.tstHash Extends %Persistent [ Final ]
{
Property longText As %String(MAXLEN = "") [ Calculated,
SqlComputeCode = {set {*}=##class(uhash).%fmHash({hash})}, SqlComputed ];

Property hash As %List;

Index hashIdx On hash [ Unique ];

Method longTextSet(txt As %String(MAXLEN="")) As %Status [ ServerOnly = 1 ]
{
  set i%hash=##class(uhash).%toHash(txt)
  quit $$$OK

}
}


You can query it also directly with SQL [ SELECT * from tstHash in ODBC-Mode to make %List readable
For insert you just can store the calculated hash like this:

INSERT tstHash(hash) VALUES(%toHash('Intersystens Developer Community is just great for information'))

HTH, Robert

 

 

  • + 7
  • 0
  • 519
  • 5

Comments

As noted there you find Duplicates
and you don't get an UNIQUE index.

The point is that a UNIQUE index  can be used similar to IDKEY  . 
eg. Open Object,  ... 

 

If you need a unique index, it is not a problem, e.g.:

ClassMethod TextBuildValueArray(
  v,
  ByRef arrAs %Status
{
  #define sizeOfChunk 400

  s sc=$$$OK
  
  i v="" {
    arr("null")=v
  else {
    hasDupl=$$$YES
    f i=1:1:$l(v)\$$$sizeOfChunk+($l(v)#$$$sizeOfChunk>0) {
      arr(i)=$e(v,1,$$$sizeOfChunk),$e(v,1,$$$sizeOfChunk)=""
      s:hasDupl&&'$d(^Test.DuplicatesI("iText"," "_i,##class(%Collation).SqlUpper(arr(i)))) hasDupl=$$$NO
    }
    hasDupl {
      arr
      arr("error")=$tr($justify("",$$$MaxLocalLength)," ","-"),
        sc=$$$ERROR($$$KeyNotUnique,v)
    }
  }
  sc
}

Unfortunately, the return status of the method BuildValueArray() is ignored, so need by force the error.
I have not yet found a way to return correct error instead of <MAXSTRING>.

Vitaliy,
I have tested your proposal to fit against requirements.
The basic idea to have just %ID of Text.Duplicates for the required UNIQUE ID looked promising.

So I did some adjustments:
- EXACT match respecting upper / lower case (affected both ELEMENTS and
KEYS !)
- To use it embedded in  %OnAddToSaveSet() requires modification or a separate
  variant of  TextBuildValueArray(). (one to build, one to check).
This was just some extra coding effort.

The design still requires some extension for that case:
- The text is cut into segments
- If by accident a new shorter text fits to a subset of a previous one it is rejected.
  so the text length has to go into the index
Probably not much effort to add (e.g. in front of text).

Finally some no go for me:
- text is stored twice: once at full size in data then in index as text junks.
   applying the index to a calculated property failed at my first attempts
- duplicated text junks are stored as often as they occur.
  so if 2 text elements just differ in the last few characters both go to store at full size
- this is all nicely object basted but that makes it remarkable slow. The code looks small
  but the instructions to execute are impressive.

To explain why I mention global size, storage efforts and instructions:
The original performance project had a size of 1.7 billion objects.
Double size for distinct text and multiple identical segments were not acceptable.
So looking  for any byte and any instruction more or less was imperative to win.

 

 

Thanks for the clarification.
For your task, my method is really not the best solution. Originally I was solve the task of Eugene, which with your poorly correlated.
To reduce the size string in the index it would be possible to pre-compress it using $system.Util.Compress(), but this too not best variant.

Hashing here, in my opinion - the best option.

Now I would have done so:

Class Test.BanDuplicates Extends %Persistent
{

Index iUniq On Hash [ Unique ];

Property Text As %VarString;

Property Hash As %Binary(MAXLEN 64MINLEN 64) [ CalculatedFinalInternalPrivateSqlComputeCode = {{*}=..getHash({Text})}, SqlComputedSqlComputeOnChange = Text ];

ClassMethod getHash(TextAs %Binary FinalPrivateServerOnly = 1 ]
{
  b=$l(Text;or 1024 (see wiki about SHAd(message) = SHA(SHA(0b || message)) (the length of 0b, zero block, is equal to the block size of the hash function))
  
  stream=##class(%Stream.TmpBinary).%New()
  stream.Write($tr($j("",b)," ",$c(0))) ; this step can be skipped
  stream.Write(Text)
  hash=$system.Encryption.SHAHash(512,$system.Encryption.SHAHashStream(512,stream))
  
  s:..iUniqExists(hash)&&(..iUniqOpen(hash).Text'=Texthash="Congratulations, you've found a collision for SHA-512 !!!"
  
  hash
}

ClassMethod Test()
{
  ;d ##class(Test.BanDuplicates).Test()
  
  ..%KillExtent()
  
  text=##class(%PopulateUtils).StringMin(200000,200000)

  &sql(insert into Test.BanDuplicates(Text)values(:text))
  w:SQLCODE "1) ",SQLCODE," : ",%msg,!

  &sql(insert into Test.BanDuplicates(Text)values(:text))
  w:SQLCODE "2) ",SQLCODE," : ",%msg,!
}

}

For older versions will probably need to use OpenSSL and CallOut or a different hashing algorithm.