Character-Slice Index

Article

Robert Cemper · Jul 8, 2023 2m read

Open Exchange

A recent question from @Vivian Lee reminded me of a rather ancient example.
It was the time when DeepSee's first version was released.
We got Bitmap Index.
And we got BitSlice Index: mapping a numeric value by its binary parts.
So my idea: Why not indexing strings by their characters?
The result of this idea was presented first in June 2008.
IKnow wasn't publicly available at that time.

The principal idea was to split Strings into its characters
Data type %Text had some kind of capability in this direction.
But it is designed to split a text string into words.
The split is language dependent and requires a dictionary.

So I had to build my own dictionary. And borrowed it.
In Japanese every single character can be a word.
This was my start and with a few adjustments it serviced my needs

The example that is also available now on IRIS and in Docker
consists of a Dictionary class and a Data class for demo data.

The result is impressive even with 3 documents of 158 lines
I compared normal [ (contains operator) to the character slices.

Search for a 2-char chunks: Global Access down from 159 to 34
Search for 2 independent chunks: Global Access down from 159 to 15

The Dictionary defines what chunks are indexed.
This example uses chunks from 1..4 char.
The larger the chunks the larger the index and the better the hit rate.
It might be worth experimenting with your special cases.

Github

#Video

#Video