Hi Benjamin,
The (patented) magic of iKnow is the way how it identifies concepts in sentences and happens in a library shipped as a binary, which we refer to as the iKnow engine and is used by both the iKnow APIs and iFind indices. Most of what happens with that engine's output is not nearly as much rocket science and as Eduard indicated, its COS source code can usually be consulted for clues on how it works if you're adventurous.
The two options of the GetSimilar() query both work by looking at the top concepts of the reference source and look for other sources that have them as well, using frequency and dominance for weighting in-source relevance for the two options respectively. So not much rocket science and only support for full matches at this point.
This said, iKnow offers you the building blocks to build much more advanced things, quite possibly inspired by your academical research, leveraging the concept level that is unique to iKnow in identifying what a text is really about. For example, you can build vectors containing entity frequency or dominance and look for cosine similarity in this vector space, or you can leverage topic modelling, but many of these will require quite a bit of computation and actual result quality may depend a bit on the nature of the texts you're dealing with, which is why we chose to stick to very simple things in the kit for now.
However, you can find two (slightly) more advanced options in demos we have published online:
- The iKnow Investigator demo is part of your SAMPLES namespace and offers, next to the SourceAPI:GetSimilar() option, also an implementation that builds a dictionary from your reference source, matches it against your target sources and looks for the ones with the highest aggregated match score, which accounts for partial matches as well.
- In the iFind Search Portal demo, the bottom of the record viewer popup displays a list of similar records as well. This one is populated based on a bit of iFind juggling implemented in the Demo.SearchPortal.Utils class, leveraging full entity matches only, but easy to extend to weigh in words as well.
In both cases, there's a myriad of options to refine these algorithms, but all at a certain compute cost, given the high dimensionality introduced by the iKnow entity (and actually even word) level. If you have further ideas or, better yet, sample code to achieve better similar document lists, we'd be thrilled to read about it here on the community ;o)
Thanks,
benjamin
- Log in to post comments