The 'unlimited' UNIQUE index
As you will know the size of a Global reference is limited to 511 encoded characters. see docs:
All UNIQUEness in Caché is affected by this limit. So is also any Index in Caché.
If you want to have an index over a long text string (>250 char) then using a hash code for indexing might be a good solution.
You will have a fair chance to create an ALMOST UNIQUE index. But there is no guaranty that it really is unique and
you have to verify your results. You it's a little bit of guessing.
Some time back (~2008) I was confronted with the requirement to have an EXACT + UNIQUE Index over a text property of
2000...5000 character preferable unlimited. Well unlimited is good for mathematic or physical theories. Nor for computing.
So for us large enough should fit. The solution I want to share is NOT unlimited, but rather large.
Find some UNIQUE hash code for your text and used it for indexing.
To get a guaranty for uniqueness I cut any large text into index-able pieces number the puzzles and store and index it.
Instead of the whole text only a (significant smaller) list of the text puzzles is stored and indexed.
In practice a text of about 2000 characters was transformed into $lb(457,8991,22,134,5434,23,11111,5624,9001)
The reduction is obvious and the "hash" is reversible. Just concatenate the text based on the index list.
A friendly side effect:
- the long text is stored away from the main global.
- eventual duplicated text is just stored once. This is depending on the way you cut your puzzles.
For my example here I took a fixed length. In reality it was a character sequence in XML.
For testing I'd recommend cutting smaller puzzles, that's easier to type and to verify.
The solution to assemble/disassemble the text has 2 pairs of methods producing equal results.
#1) toHashOBJ + fmHashOBJ that both work just with object and SQL
#2) %toHash + %fmHash touch the globals directly and are remarkable faster.
This is the original version from the benchmark
toHash takes any String and returns a $LB() of puzzle indices
fmHash takes a $LB() of indices and returns the original text
Needless to say that manipulation of indices and/or text store can cause any kind of confusion.
To try it out this tiny class may serve as an example how to use it.
You can query it also directly with SQL [
SELECT * from tstHash
For insert you just can store the calculated hash like this:
INSERT tstHash(hash) VALUES(%toHash('Intersystens Developer Community is just great for information'))