yes, the two-word feature called "executing COS" would probably be quite a step up. It was more a loose idea than something I've researched thoroughly, but maybe the authors of the Caché Web Terminal have some clues on how the connectivity should work (JDBC won't pull it). 

Nice article Andreas!

Have you perhaps also looked into creating a more advanced interpreter, rather than just leveraging the JDBC one? I know that's probably a significantly more elaborate thing to do, but a notebook-style interface for well-documented scripting would nicely complement the application development focus of Atelier / Studio.

Thanks,
benjamin

The $$$SIMSRCDOMENTS is much more restrictive and may not yield any results if your domain is small and sources are too far apart. I see results when trying it in the Aviation demo dataset. Note that you can loosen it by setting the "strict" parameter to 0 as described in the class ref.

That third alternative you quoted has been deprecated and does not anything to the regular $$$SIMSRCSIMPLE option. You dug too deep in the code ;o)

 

Regards,
benjamin

you are entirely correct. 

The separate MatchScore column is to accommodate methods where the score is more refined than the pure count-based one with $$$SIMSRCSIMPLE. With $$$SIMSRCDOMENT, dominance is accounted for in this metric and you'll see it'll differ from percentageMatched

Hi Benjamin,

The (patented) magic of iKnow is the way how it identifies concepts in sentences and happens in a library shipped as a binary, which we refer to as the iKnow engine and is used by both the iKnow APIs and iFind indices. Most of what happens with that engine's output is not nearly as much rocket science and as Eduard indicated, its COS source code can usually be consulted for clues on how it works if you're adventurous. 

The two options of the GetSimilar() query both work by looking at the top concepts of the reference source and look for other sources that have them as well, using frequency and dominance for weighting in-source relevance for the two options respectively. So not much rocket science and only support for full matches at this point. 

This said, iKnow offers you the building blocks to build much more advanced things, quite possibly inspired by your academical research, leveraging the concept level that is unique to iKnow in identifying what a text is really about. For example, you can build vectors containing entity frequency or dominance and look for cosine similarity in this vector space, or you can leverage topic modelling, but many of these will require quite a bit of computation and actual result quality may depend a bit on the nature of the texts you're dealing with, which is why we chose to stick to very simple things in the kit for now.

However, you can find two (slightly) more advanced options in demos we have published online:

  • The iKnow Investigator demo is part of your SAMPLES namespace and offers, next to the SourceAPI:GetSimilar() option, also an implementation that builds a dictionary from your reference source, matches it against your target sources and looks for the ones with the highest aggregated match score, which accounts for partial matches as well.
  • In the iFind Search Portal demo, the bottom of the record viewer popup displays a list of similar records as well. This one is populated based on a bit of iFind juggling implemented in the Demo.SearchPortal.Utils class, leveraging full entity matches only, but easy to extend to weigh in words as well.

In both cases, there's a myriad of options to refine these algorithms, but all at a certain compute cost, given the high dimensionality introduced by the iKnow entity (and actually even word) level. If you have further ideas or, better yet, sample code to achieve better similar document lists, we'd be thrilled to read about it here on the community ;o)

Thanks,
benjamin

Hi Edward,

the thing that comes closest here would be the %iKnow.Queries.SourceAPI:GetSimilar() query, which for a certain seed document, looks for the most similar ones in a domain, optionally constrained by a filter object. The results of that query include a figure like the one you're looking for, expressing how many entities were new in the seed document vs the corpus it's comparing against. Although that that particular calculation isn't available as an atomic function, a simple way to get to what you want would be to use the %iKnow.Filters.SourceIdFilter and just compare against an individual document.

If you prefer to write more code :o), you can just look up the entities in the one document and compare them against those in the others through the %iKnow.Objects.EntityInSourceDetails SQL projection.

Regards,

benjamin

I sometimes like to call myself Kyle on Thursdays that feel like Mondays ;o) 

 

Like any operator whose name starts with a %, it's an InterSystems-specific one, so both %CONTAINS and %FIND were added to the standard for specific purposes. %CONTAINS has been part of Caché for a long while and indeed offers a simple means to directly refer to your %Text field. So, you don't need to know the name of your index, but you do need to make sure your text field is of type %Text, so there's still a minimum of table metadata you need to be aware of.

In the case of %FIND, we're actually leveraging the %SQL.AbstractFind infrastructure, which is a much more recent addition that allows leveraging code-based filters in SQL. In this case, the code-based filter implementation is provided by the %iFind infrastructure. Because this AbstractFind interface is, what's in a name, abstract, it needs a bit of direction in the form of the index name to wire it to the right implementation, which in our case comes as the index name. As the AbstractFind interface is expected to return a bitmap (for fast filtering), it also needs to be applied to an integer field, typically the IDKey of your class. So, while it offers a lot of programmatic flexibility to implement optimized filtering code (cf this article on spatial index), it does require this possibly odd-looking construct on the SQL side. But that's easily hidden behind a user interface, of course.

A bit of a long read, but very nice illustration of how iKnow's bottom up approach allows you to work with the full concepts as coined by the author rather than a top-down approach relying on predefined lists of terms. If this is only their bachelor thesis, I'm looking forward to see their master thesis :-)

Thanks for sharing Otto!

What is the attribute of your texts that you want to categorize? When building TC models on top of iKnow domains, as Chip indicated, the easiest way is to do this based on a metadata field using the %LoadMetadataCategories method on your builder object. The more manual method is using these %AddCategory methods you're using, but they require a filter specification as the second argument (you passed "") to identify which records correspond to the category you're adding. That's what's making it a training set, a set of texts for which the outcome (category) is known, so the TC infrastructure can try to find corellations to exploit. 

Separately, 11 records is a very small training set. I would not expect to find much statistically relevant information to exploit. Even when you're building a rule-based model manually, it'll be barely enough to validate your model. So probably it's worth trying to get hold of (much) more data to start with.

I've uploaded a tutorial we prepared for a Global Summit academy session on the subject in a separate thread (could not attach it to an existing one). That may give you a better idea of what the infrastructure can be used for and also takes you through the GUI, which may be more practical than the code-based approach you're referring to.