Search InterSystems documentation using iKnow and iFind technologies

Primary tabs

image

The InterSystems DBMS has a built-in technology for working with non-structured data called iKnow and a full-text search technology called iFind. We decided to take a dive into both and make something useful. As the result, we have DocSearch — a web application for searching in InterSystems documentation using iKnow and iFind.

How Caché Documentation works

Caché documentation is based on the Docbook technology. It has a web interface (which includes a search that uses neither iFind nor iKnow). The articles themselves are stored in Caché classes, which allows us to run queries against this data and, of course, to create our own search tool.

What is iKnow and iFind

Intersystems iKnow is a technology for analyzing unstructured data, which provides access to this data by indexing sentences and instances in it. To start the analysis, you first need to create a domain — a storage for unstructured data, and load a text to it.

The iFind technology is a module of the Caché DBMS for performing full-text search in Caché classes. iFind uses many iKnow classes for intelligent text search. To use iFind in your queries, you need to introduce a special iFind index in your Caché class.

There are three types of iFind indexes, each offering all the functions of the previous type, plus some additional ones:

  • The main index (%iFind.Index.Basic): supports the search for words and word combinations.
  • Semantic index (%iFind.Index.Semantic): supports the search for iKnow objects.
  • Analytic search (%iFind.Index.Analytic): supports all iKnow functions of the semantic search, as well as information about paths and word proximity.

 

Since documentation classes are stored in a separate namespace, if you want to make classes available in ours, the installer also performs mapping of packages and globals.

Installer code for mapping

 
XData Install [ XMLNamespace = INSTALLER ]
{
<Manifest>
// Specify the name of the namespace
<IfNotDef Var="Namespace"> 
<Var Name="Namespace" Value="DOCSEARCH"/> 
<Log Text="Set namespace to ${Namespace}" Level="0"/> 
</IfNotDef>

// Check if the area exists
<If Condition='(##class(Config.Namespaces).Exists("${Namespace}")=1)'> 
<Log Text="Namespace ${Namespace} already exists" Level="0"/> 
</If>

// Creating the namespace
<If Condition='(##class(Config.Namespaces).Exists("${Namespace}")=0)'> 
<Log Text="Creating namespace ${Namespace}" Level="0"/>
	
// Creating a database
<Namespace Name="${Namespace}" Create="yes" Code="${Namespace}" Ensemble="" Data="${Namespace}"> 
<Log Text="Creating database ${Namespace}" Level="0"/>

// Map the specified classes and globals to a new namespace
<Configuration> <Database Name="${Namespace}" Dir="${MGRDIR}/${Namespace}" Create="yes" MountRequired="false" 
Resource="%DB_${Namespace}" PublicPermissions="RW" MountAtStartup="false"/> 
<Log Text="Mapping DOCBOOK to ${Namespace}" Level="0"/> 
<GlobalMapping Global="Cache*" From="DOCBOOK" Collation="5"/> 
<GlobalMapping Global="D*" From="DOCBOOK" Collation="5"/> 
<GlobalMapping Global="XML*" From="DOCBOOK" Collation="5"/> 
<ClassMapping Package="DocBook" From="DOCBOOK"/> 
<ClassMapping Package="DocBook.UI" From="DOCBOOK"/> 
<ClassMapping Package="csp" From="DOCBOOK"/> 
</Configuration>

<Log Text="End creating database ${Namespace}" Level="0"/> 
</Namespace> <Log Text="End creating namespace ${Namespace}" Level="0"/> 
</If>

</Manifest>
}

The domain required for iKnow is built upon the table containing the documentation. Since we use a table as the data source, we'll use SQL.Lister. The content field contains the documentation text, so let's specify it as the data field. The rest of the fields will be described in the metadata.

Installer code for creating a domain

 
ClassMethod Domain(ByRef pVars, pLogLevel As %String, tInstaller As %Installer.Installer) As %Status
{
	#Include %IKInclude
	#Include %IKPublic
	set ns = $Namespace
	znspace "DOCSEARCH"
	// Create a domain or open it if it exists
	set dname="DocSearch" 
   	if (##class(%iKnow.Domain).Exists(dname)=1){
	   	write "The ",dname," domain already exists",!
		zn ns
		quit
        }
  	else {	 
  		write "The ",dname," domain does not exist",!
       	set domoref=##class(%iKnow.Domain).%New(dname)
       	do domoref.%Save()
        }
   	set domId=domoref.Id
   	// Lister is used for searching for sources corresponding to the records in query results
  	set flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  	set myloader=##class(%iKnow.Source.Loader).%New(domId)
  	// Building a query
	set myquery="SELECT id, docKey, title, bookKey, bookTitle, content, textKey FROM SQLUser.DocBook"
 	set idfld="id"
 	set grpfld="id"
 	// Specifying the fields for data and metadata
  	set dataflds=$LB("content")
  	set metaflds=$LB("docKey", "title", "bookKey", "bookTitle", "textKey")
        // Putting all data into Lister
  	set stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds,metaflds)
        if stat '= 1 {write "The lister failed: ",$System.Status.DisplayError(stat) quit }
        //Starting the analysis process 
        set stat=myloader.ProcessBatch()
        if stat '= 1 {
	      quit 
	       }
        set numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
        write "Done",!
        write "Domain cointains ",numSrcD," source(s)",!
        zn ns
        quit
}

To search in documentation, we use the %iFind.Index.Analytic index:

Index contentInd On (content) As %iFind.Index.Analytic(LANGUAGE = "en", 
LOWER = 1, RANKERCLASS = "%iFind.Rank.Analytic");

Where contentInd is the name of the index and content is the name of the field that we are creating an index for. The LANGUAGE = “en” parameter sets the language of the text The LOWER = 1 parameter turns off case sensitivity The RANKERCLASS = "%iFind.Rank.Analytic" parameter allows to use the TF-IDF result ranking algorithm

After adding and building such an index, it can be used in SQL queries, for example. The general syntax for using iFind in SQL:

SELECT * FROM TABLE WHERE %ID %FIND
 search_index(indexname,'search_items',search_option)

After creating the %iFind.Index.Analytic index with such parameters, several SQL procedures of the following type are generated: [table_name]_[index name]Procedure name

In our project, we use two of them:

  • DocBook_contentIndRank — returns the result of the TF-IDF ranking algorithm for a request The procedure has the following syntax:
    SELECT DocBook_contentIndRank(%ID, ‘SearchString’, ‘SearchOption’) Rank FROM DocBook
     WHERE %ID %FIND search_index(contentInd,‘SearchString’, ‘SearchOption’)
  • DocBook_contentIndHighlight — returns the search results, where the searched words are wrapped into the specified tag:
    SELECT DocBook_contentIndHighlight(%ID, ‘SearchString’, ‘SearchOption’,’Tags’) Text FROM DocBook
     WHERE %ID %FIND search_index(contentInd,‘SearchString’, ‘SearchOption’)

I will go into more detail later in the article.

What do we have in the end:

  1. Autocomplete in the search field

    As you start entering text into the search field, the system will suggest possible query variants to help you find the necessary information quicker. These suggestions are generated on the basis of the word (or its beginning) that you types. The system shows ten best matching words or phrases. This process uses iKnow, the %iKnow.Queries.Entity.GetSimilar method

    image

     

  2. Fuzzy string search

    iFind supports fuzzy search for finding words that almost match the search query. This is achieved by measuring the Levenshtein distance between two words. Levenshtein distance is the minimal number of one-character changes (inserts, removals or replacements) necessary for turning one word into another. It can be used for correcting typis, small variations in writing, different grammatic forms (plural and singular, for exampe).

    In iFind SQL queries, the search_option parameter is responsible for the fuzzy search. search_option = 3 denotes a Levenshtein distance of 2. To set a Levenshtein distance equal to n, you need to set the search_option parameter to ‘3:n’ Documentation search uses a Levenshtein distance of 1, so let's demonstrate how it works: Let's type “ifind” in the search field:

    image

     

    Let's try a fuzzy search by intentionally making a typo. As we can see, the search corrected the typo and found the necessary articles.

    image

     

  3. Complex searches

    Thanks to the fact that iFind supports complex queries with brackets and AND OR NOT operators, we were able to implement complex search functionality. Here's what you can specify in your query: word, word combination, one of several words, exceptions. Fields can be filled one by one, or all at once.

    For example, let's find articles containing the word “iknow”, the combination “rest api” and those that contain either “domain” or “UI”.

    image

     

    We can see that there are two such articles:

    image

     

    Please note that the second one mentions Swagger UI, so we can modify the query to make it exclude those ones that do not contain the word Swagger.

    image

     

    As the result, we will only find one article:

    image

     

  4. Search results highlighting

    As stated above, the use of an iFind index creates the DocBook_contentIndHighlight procedure. Let's use the following:

    SELECT DocBook_contentIndHighlight(%ID, 'search_items', '0', '<span class=""Illumination"">', 0) Text FROM DocBook

    To get the resulting text wrapped into a tag

    <span class=""Illumination""> 

    This helps you to visually mark search results on the front-end.

    image
  5. Search results ranking

    Find is capable of ranking results using the TF-IDF algorithm. TF-IDF is often used in text analysis and data search tasks – for example, as a criterion of relevance of a document to a search query.

    As the result of the SQL query, the Rank field will contain the weight of the word that will be proportionate to the number of times the word was used in an article, and reversely proportionate to the frequency of the word’s occurrence in other articles.

    SELECT DocBook_contentIndRank(%ID, ‘SearchString’, ‘SearchOption’) Rank FROM DocBook 
    WHERE %ID %FIND search_index(contentInd,‘SearchString’, ‘SearchOption’)
  6. Integration with the official documentation search

    After installation, a “Search using iFind” button will be added to the official documentation search.

    image

     

    If the “Search words” field is filled, you will be taken to the search results page after clicking the “Search using iFind” button.

    If the field is empty, you will be taken to the new search page.

Installation

  1. Download the Installer.xml file from the latest release available on the corresponding page.
  2. Import the loaded Installer.xml file into the %SYS namespace and compile it.
  3. Enter the following command in the terminal in the %SYS namespace:
    do ##class(Docsearch.Installer).setup(.pVars)

After that, the search will be available at the following address localhost:[port]/csp/docsearch/index.html

Demo

An online demo of the search is available here.

Conclusion

This project demonstrates interesting and useful capabilities of iFind and iKnow technologies that make data search more relevant. Any comments or suggestions will be highly appreciated. The entire source code with the installer and the deployment guide is available on github

Comments

Hi Konstantin,

thanks for sharing your work, a nice application of iFind technology! If I can add a few ideas to make this more lightweight:

  • Rather than creating a domain programmatically, the recommended approach for a few versions now has been to use Domain Definitions. They allow you to declare a domain in an XML format (not much unlike the %Installer approach) and avoid a number of inconveniences in managing your domain in a reproducible way.
  • From reading the article, I believe you're just using the iKnow domain for that one EntityAPI:GetSimilar() call to generate search suggestions. iFind has a similar feature, also exposed through SQL, through %iFind.FindEntities() and %iFind.FindWords(), depending on what kind of results you're looking for. See also this iFind demo. With that in place, you may even be able to skip those domains altogether :-)

thanks,
benjamin

Thanks for posting this Konstantin. For a long time I have been wondering why InterSystems hadn't done this already.

I've had something simple running on my laptop already a long time ago, but the internal discussion on how to package it proved a little more complicated. Among other things, an iFind index requires an iKnow-enabled license (and more space!), which meant you couldn't simply include it in every kit.

Also, for the ranking of docbook results, applying proper weights based on the type of content (title / paragraph / sample / ...) was at least as important as the text search capabilities themselves. That latter piece has been well-addressed in 2017.1, so docbook search is in pretty good shape now. Blending in an easily-deployable iFind option as Konstantin published can only add to this!

Thanks,
benjamin

Hi, Konstantin!

I tried to search $Case word it finds, but it shows strange option in a dropdown list of a search field. See the screenshot:

What does it mean?

Hi, Evgeny!
I used iKnow Entities as words in a dropdown list of a search field.  iKnow thinks "$case( $extract( units, 1" is entity, because it look some strange. 
​But I would like to use %iFind.FindEntities() (Idea from first Benjamin DeBoe's comment) for words in dropdown list of a search field after a short time.  I think it will fix this

iKnow was written to analyze English rather than ObjectScript, so you may see a few odd results coming out of code blocks. I believe you can add a where clause excluding those records from the block table to avoid them.

Now I use %iFind.FindEntities to get words in a dropdown list of a search field.  Installation has become faster than before, because I don't use domain builiding process

Hi, Konstantin!

The problem with strange suggestions fixed, but it doesn't suggest anything for $CASE now ) 

Did you introduce $CASE in a blacklist? )

I think suggestions on all COS commands and functions is a good option for the search field (if possible of course).

 

Hi, Evgeny!
Yes, I agree with you about COS commands in a dropdown list of a search field.
I had some problems with COS commands and functions. But now I fixed it:

Hi Konstantin,

Can we install this project on Cache 2016.2 or does it need 2017 ?

I tried to install offline (becuse my server cannot get through to GITHUB(443)) and the installation failed on several errors.

Maybe I need more specific instructions for offline install ?

Uri

Hi, Constantin!

When I search documentation with your online tool what is the version of documentation it works with?

Would you please add the version of the product in the results or somewhere?

Thanks in advance!

Thanks, Konstantin!

And here is the link to the demo.

Do you want to add an option to share the search? E.g. introduce some share results button in UI which would provide an URL with added search option in URL? It would be very handy if you want to share search results with a colleague.

Good day,

I would very much like to install this example on my local instance. However, I cannot find installer.xml on "corrresponding page". Which is the "corresponding page" please? I downloaded the solution from Github, but also there is no installer.xml. I will apprecitae it if you can point me to the "corresponding page" where the installer.xml is please.

Thank you in advance.