Free Text Search: The Way To Search Your Text Fields That SQL Developers Are Hiding From You!*

Have some free text fields in your application that you wish you could search efficiently?  Tried using some methods before but found out that they just cannot match the performance needs of your customers?  Do I have one weird trick that will solve all your problems?  Don’t you already know!?  All I do is bring great solutions to your performance pitfalls!

As usual, if you want the TL;DR (too long; didn’t read) version, skip to the end.  Just know you are hurting my feelings.

If you open up your version of Sample.Company in the SAMPLES namespace of a recent (2015.1 or later) Caché/Ensemble/HealthShare version you will see a Mission field that is pseudo-randomly generated text.  Suppose we want to search this text field.  For the purpose of this exercise, I generated about 256,246 companies – feel free to populate some on your own and follow along.  Well you might run the following query:

SELECT * FROM Sample.Company WHERE Mission LIKE ‘% agile %’

That is quite the reasonable query, but how does it perform?  Well, with no index it clearly has to read each entry, so we get 256,277 global references for our 7854 rows returned.  This is not very good!  Let’s add an index to Sample.Company and see if we can do better.  We add the following:

Index MissionIndex on Mission;

And we build the index and run the same query.  What do we see? 279,088 global references.  

Ah, I can hear you over there: “But Kyle, that’s MORE global references!  Isn’t that bad?  I thought indexes were supposed to help!!!”

Well slow down there, 3 exclamation points is just too many. Besides, when confronted with some inconsistent behavior, it’s best to take a moment and think.  What is the cost of reading an index vs. reading the full map?  Well an index is going to be smaller, so reading the full MissionIndex is going to be a cheaper operation than doing a full table scan.  Then we only need to read pieces of the full table for display.  So while it’s more Global References, it’s less work (assuming everything is on disk).  There is, of course, a lot more to this behavior but I would have to write extensively on the Caché Block structure and that’s just WAY outside the scope of what I’m trying to accomplish here.  

OK, so our first crack and cheapening our query might have worked, but certainly less than we’d like.  We want lightning fast!  We want less global references, and we want it easy!  What can we do?  The answer is an iFind Index.

Alright, I know you don’t know anything about iKnow and this iFind thing sound suspiciously like iKnow.  You want an SQL solution and you really don’t want to learn the ins and outs of a whole new technology to get the performance you want.  Don’t worry, although iFind leverages the iKnow engine, you need precisely ZERO knowledge of iKnow to be able to use this.  Check it out, you define the index like this:

Index MissioniFind on (Mission) as %iFind.Index.Basic;

What does %iFind.Index.Basic mean?   Doesn’t matter!  Isn’t that beautiful!?  Just put it in and build the index the way you always build indexes (%BuildIndices()).  Now, with this index there’s a bit of a change in our query.  We need to tell the query to use this new fancy index like so:

SELECT * FROM Sample.Company WHERE %ID %FIND search_index(MissioniFind,’agile’)

The only thing you are entering, is the name of the index and the word you are looking for.  OK so that’s a lot to do, add an index, build it up, and use some funkier SQL than you might be used to.  Is it worth it?  Well this query gets the same 7854 rows back, but does so in 7928 global references.

7928 GLOBAL REFERENCES!  That’s barely more Global References than rows!  That’s some good stuff.  Now I can tell you all have some questions, so I’m going to not only predict the questions but answer them!

Can we combine this with other indexes?  Great question!  The answer is YES!  Index combinations work really well with this technology.  

Are there any restrictions?  Sadly, yes.  You need to have a bitmap enabled ID.  That is, the ID for your table needs to be a positive integer.  This means no compound IDs, no string IDs, etc.

Does it really work that well?  BETTER!  If you have a free text field and haven’t at least tried an iFind index, you are doing yourself a disservice!

So what about iKnow?  The iFind index is powered by the iKnow engine, however you don’t need to know anything about iKnow to be able to use iFind in your existing applications.  Just define, build, and use!

Where can I learn more?  Documentation of course!  Here: http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=GIKNOW_ifind

And Here:

http://docs.intersystems.com/latest/csp/documatic/%25CSP.Documatic.cls?PAGE=CLASS&LIBRARY=%25SYS&CLASSNAME=%25iFind.Index.Basic

Does iFind do other cool stuff? Of course! It’s powered by iKnow so it can do a lot!  You can do Fuzzy Search, Stemming/Compounding, Ranking, and you can look for iKnow entities.  If you’re interested, take a look at the %iFind.Index.Semantic and %iFind.Index.Analytic index classes.  They use more iKnow bells and whistles than the Basic one we used above. Or, check out this DC post on a demo interface that works on top of any iFind index: https://community.intersystems.com/post/iknow-demo-apps-part-5-ifind-search-portal .

Most of this is beyond the scope of this post, but if you want to see more details on any of these subjects, just ask!  You all know I’m here to inform and enlighten!

---------------------------------------------------------------------------------------------

TL;DR: If you want to do search on a free text field add an iFind index by defining it like so:

Index <IndexName> on <FreeTextField> as %iFind.Index.Basic

Build the index using %BuildIndices (as normal)

Rewrite your query like so:

…WHERE ID %FIND search_index(<IndexName>,<Search Value>) AND …

And reap the benefits of lightning fast free text search!  

*No one is hiding this from you!  Come on now, you know by now I just do it for the clicks!

  • + 19
  • 1
  • 1246
  • 11

Comments

Interesting post!

Just curious, what is the performance impact when inserting data to associated table with this iFind index in place?

Excellent question.

Of course, when you are going to be using iFind it does mean a heavier burden for your INSERT/UPDATE/DELETE functions.  It also does take up more space on disk than a traditional index.   The cost is going to depend on what kind of iFind index you are using and the size of the data being indexed.  For my example an insert went from about 5 global references to about 200.  But that is, of course, still fast due to the way Caché manages writes to disk.  On INSERT went from 0.0002 seconds to 0.0032 seconds.  So significantly slower, but still plenty fast.  If you test a use case and come up with something different, post it here!!!

Hi Kyle

When / why would I chose an iFind index over a full text index

INDEX MissionKW ON Mission(KEYS) [ TYPE=BITMAP ];

that I can search via SQL as such

SELECT * FROM Sample.Company WHERE Mission %CONTAINS ('agile')

 

That SQL code seems a lot more intuitive then the one one required by iFind where I have to know the index name.

This deserves a full answer that I will give on Monday - just want to put this here as a placeholder.  But, spoiler, iFind is better.

I sometimes like to call myself Kyle on Thursdays that feel like Mondays ;o) 

 

Like any operator whose name starts with a %, it's an InterSystems-specific one, so both %CONTAINS and %FIND were added to the standard for specific purposes. %CONTAINS has been part of Caché for a long while and indeed offers a simple means to directly refer to your %Text field. So, you don't need to know the name of your index, but you do need to make sure your text field is of type %Text, so there's still a minimum of table metadata you need to be aware of.

In the case of %FIND, we're actually leveraging the %SQL.AbstractFind infrastructure, which is a much more recent addition that allows leveraging code-based filters in SQL. In this case, the code-based filter implementation is provided by the %iFind infrastructure. Because this AbstractFind interface is, what's in a name, abstract, it needs a bit of direction in the form of the index name to wire it to the right implementation, which in our case comes as the index name. As the AbstractFind interface is expected to return a bitmap (for fast filtering), it also needs to be applied to an integer field, typically the IDKey of your class. So, while it offers a lot of programmatic flexibility to implement optimized filtering code (cf this article on spatial index), it does require this possibly odd-looking construct on the SQL side. But that's easily hidden behind a user interface, of course.

I thought IKnow required a special license. Can I really do all of this with my regular old Caché license?

Ah licensing - I should have covered that.  This does require an iKnow enabled license, but one of those should be easy to obtain, at least for development purposes.  Just contact InterSystems or your already assigned sales representative.  

Does this work with character streams also?