Written by

Question Benjamin Eriksson · Mar 14, 2016

[Research] iKnow and algorithms.

#InterSystems Natural Language Processing (NLP, iKnow) #Analytics #Vector Search

Hello!

My group and I are currently doing a research project on natural language processing and iKnow plays a big role in this project. I am aware that the algorithms iKnow use aren't public, and I respect that.

My question is, are there any public documents/research that explains, at least part of, the algorthims iKnow uses and the motivations for using them?

Here is a concrete example: We are using GetSimilar() for many of our results and it works very well. As the documentation states we can choose to look att entites, crc, or cc and choose between the algorithms SIMSRCSIMPLE and SIMSRCDOMENTS.

Are there any additional information about SIMSRCSIMPLE and SIMSRCDOMENTS other than the GetSimilar documentation?
How are partial matches handled? E.g. the CRCs "He is happy" and "She is happy". Will they get some points or none?

I apologize if this is the wrong place to ask but I've gotten so much great feedback here before so I thought it was worth a try.

Thanks!

Discussion (2)0

Comments

Eduard Lebedyuk · Mar 14, 2016

You can open this (any) method in Studio and see the definition (with some rare exceptions, in iKnow package only %iKnow.TextTransformation.HeaderRepositorySetArray and %iKnow.TextTransformation.KeyRepositorySetArray classes are not availible). It's the best way to get an idea of how method works and the code usually even has comments.

Scrapped from GetSimilar():

Select the most probably relevant terms in the source (top N)
Select all sources containing at least one of these N target elements
Sort these candidates by the number of target elements they share with the reference documents (approximate score)
Of these sources, calculate the actual similarity score for the top M sources with the best approximate score
Now store the page window in the final result PPG

0 0

Benjamin De Boe · Jan 5, 2017

Hi Benjamin,

The (patented) magic of iKnow is the way how it identifies concepts in sentences and happens in a library shipped as a binary, which we refer to as the iKnow engine and is used by both the iKnow APIs and iFind indices. Most of what happens with that engine's output is not nearly as much rocket science and as Eduard indicated, its COS source code can usually be consulted for clues on how it works if you're adventurous.

The two options of the GetSimilar() query both work by looking at the top concepts of the reference source and look for other sources that have them as well, using frequency and dominance for weighting in-source relevance for the two options respectively. So not much rocket science and only support for full matches at this point.

This said, iKnow offers you the building blocks to build much more advanced things, quite possibly inspired by your academical research, leveraging the concept level that is unique to iKnow in identifying what a text is really about. For example, you can build vectors containing entity frequency or dominance and look for cosine similarity in this vector space, or you can leverage topic modelling, but many of these will require quite a bit of computation and actual result quality may depend a bit on the nature of the texts you're dealing with, which is why we chose to stick to very simple things in the kit for now.

However, you can find two (slightly) more advanced options in demos we have published online:

The iKnow Investigator demo is part of your SAMPLES namespace and offers, next to the SourceAPI:GetSimilar() option, also an implementation that builds a dictionary from your reference source, matches it against your target sources and looks for the ones with the highest aggregated match score, which accounts for partial matches as well.
In the iFind Search Portal demo, the bottom of the record viewer popup displays a list of similar records as well. This one is populated based on a bit of iFind juggling implemented in the Demo.SearchPortal.Utils class, leveraging full entity matches only, but easy to extend to weigh in words as well.

In both cases, there's a myriad of options to refine these algorithms, but all at a certain compute cost, given the high dimensionality introduced by the iKnow entity (and actually even word) level. If you have further ideas or, better yet, sample code to achieve better similar document lists, we'd be thrilled to read about it here on the community ;o)

Thanks,
benjamin

0 0