Hi Steve,

hadn't seen this question until just now, but I have to admit we're a bit storage-hungry with iKnow. If you generate the default full set of indices, for a moderately-sized domain you'll need up to 25x the original dataset size measured as raw text to fit everything. This can drop to half that size (12x) if you forsake all non-essential indices, but that will prevent a number of queries from running smoothly or, in some cases, disable them completely.

For iFind, the numbers are dependent on the type of index. Count on factors 2x, 7x and 15x for Basic, Semantic and Analytic indices, respectively. Of course there's a difference in functionality between all these options and it's best to start from a set of functional requirements and then look at which particular approach covers those.

These numbers are somewhat conservative maximums and, as Eduard already suggested, you may see different (lower) numbers depending on the nature of your data. A more detailed sizing guide is available on request.

Thanks,
benjamin

Thanks John,

indeed, you'd need a proper license in order to work with iKnow. If the method referred above would return 0, please contact your sales representative to request a temporary trial license and appropriate assistance for implementing your use case.

Also, iKnow doesn't come as a separate namespace. You can create (regular) namespaces as you prefer and use them to store iKnow domain data. You may need to enable your web application for iKnow, which is disabled by default for security reasons in the same way DeepSee is. See this paragraph here for more details.

Hi Benjamin,

The (patented) magic of iKnow is the way how it identifies concepts in sentences and happens in a library shipped as a binary, which we refer to as the iKnow engine and is used by both the iKnow APIs and iFind indices. Most of what happens with that engine's output is not nearly as much rocket science and as Eduard indicated, its COS source code can usually be consulted for clues on how it works if you're adventurous. 

The two options of the GetSimilar() query both work by looking at the top concepts of the reference source and look for other sources that have them as well, using frequency and dominance for weighting in-source relevance for the two options respectively. So not much rocket science and only support for full matches at this point. 

This said, iKnow offers you the building blocks to build much more advanced things, quite possibly inspired by your academical research, leveraging the concept level that is unique to iKnow in identifying what a text is really about. For example, you can build vectors containing entity frequency or dominance and look for cosine similarity in this vector space, or you can leverage topic modelling, but many of these will require quite a bit of computation and actual result quality may depend a bit on the nature of the texts you're dealing with, which is why we chose to stick to very simple things in the kit for now.

However, you can find two (slightly) more advanced options in demos we have published online:

  • The iKnow Investigator demo is part of your SAMPLES namespace and offers, next to the SourceAPI:GetSimilar() option, also an implementation that builds a dictionary from your reference source, matches it against your target sources and looks for the ones with the highest aggregated match score, which accounts for partial matches as well.
  • In the iFind Search Portal demo, the bottom of the record viewer popup displays a list of similar records as well. This one is populated based on a bit of iFind juggling implemented in the Demo.SearchPortal.Utils class, leveraging full entity matches only, but easy to extend to weigh in words as well.

In both cases, there's a myriad of options to refine these algorithms, but all at a certain compute cost, given the high dimensionality introduced by the iKnow entity (and actually even word) level. If you have further ideas or, better yet, sample code to achieve better similar document lists, we'd be thrilled to read about it here on the community ;o)

Thanks,
benjamin

Hi Edward,

the thing that comes closest here would be the %iKnow.Queries.SourceAPI:GetSimilar() query, which for a certain seed document, looks for the most similar ones in a domain, optionally constrained by a filter object. The results of that query include a figure like the one you're looking for, expressing how many entities were new in the seed document vs the corpus it's comparing against. Although that that particular calculation isn't available as an atomic function, a simple way to get to what you want would be to use the %iKnow.Filters.SourceIdFilter and just compare against an individual document.

If you prefer to write more code :o), you can just look up the entities in the one document and compare them against those in the others through the %iKnow.Objects.EntityInSourceDetails SQL projection.

Regards,

benjamin

Hi Terri,

[a little late perhaps (and we spoke briefly in Phoenix as well), but for the sake of completeness, here's your response]

The iKnow Architect page only shows domains based on domain definitions, subclasses of %iKnow.DomainDefinition. The ones created "the old way" (programmatically) through the %iKnow.Domain class do not carry the declarative information about where to load data from, hence the majority of the Architect GUI wouldn't be available on those.

As to the metadata loading question: I'm not sure which part of the script you're referring to was taking care of that metadata loading, but maybe this comes back to your question in a separate thread.

regards,

benjamin

Hi Orion,

I haven't heard about index data simply disappearing. There should be no reason for that other than calls to %PurgeIndices() or manually dropping the globals containing the data (including ^ISC.IF* ones containing entities and words shared across the namespace). Another potential issue may arise when importing (at a global level) only either the index or the shared data, resulting in them no longer being in sync. Any chance any of that could have happened?

For the missing class, that might be due to a class import as well, after which not all related classes were recompiled to reflect the changes.

Which version are you working on? Some of those recompiling issues may have been addressed in recent versions.

Perhaps the WRC is a better place to get the appropriate follow-up for specific issues like this one. 

regards,
benjamin

Hi Benjamin,

the default algorithm indeed won't return scores for each record, but will only make the calculation for all records that contain at least a decent number of entities that are relevant in the source document. You can indeed simply approximate the other documents' score by taking 0.

For your specific use case, you may want to take a look at the text categorization infrastructure. I've posted a tutorial on the topic here.

regards,
benjamin

Hi Jack,

there's no need to normalize your search strings, as it's take care of automatically as part of executing your search when appropriate.

When you use DELETE FROM in SQL, or ##class(Your.Table).%DeleteExtent() in COS, the associated iFind indices' data will be erased as well. To drop just the indices data, use ##class(Your.Table).%PurgeIndices() (cf class ref for refinements). Note that, unless you are using index-local storage (new feature in 2016.1), the words and entities tables will not be wiped as they are shared between all iFind indices in your namespace (somewhat conserving space and indexing efficiency).

iFind can calculate a score representing how well a record satisfies a search string, largely based on TFIDF (although it'll leverage the more refined dominance scores for entities when it can). This is also new in 2016.1. See https://community.intersystems.com/code/ifind-search-portal for an example.

 

regards,
benjamin

Hi Benjamin (sounds like a conversation amongst just Benjamins now!),

 

The knowledge portal demonstration interface you find in the %iKnow.UI package (which gets a significant visual overhaul in 2016.3) is written using InterSystems' Zen technology, a web development framework that helps you combine client-side JavaScript and server-side Caché ObjectScript to build web applications. If you're good with PHP and/or JavaScript, there's no strict need to dig into Zen to build an iKnow-powered application. You can either use ODBC to connect to Caché and use SQL as in the above examples to query an iKnow domain, or you can build a simple REST service on top of iKnow (in Caché ObjectScript) and query that from your PHP/JavaScript code. We'll be releasing an out-of-the-box REST interface with 2016.3, but it's no rocket science to build one that fits your needs on earlier versions. If you already have an ISC sales engineering contact (none of them called Benjamin, unfortunately ;o) ), we can work together to get you up and running.

 

FYI, this github repo contains a simple iKnow demo application written with AngularJS and a REST interface. It's technically speaking a CSP page (yet another ISC web technology at a lower level than Zen), but could have been a straight HTML page.

 

 

Regards,

Benjamin

Hi Jack,

you can enable stemming by setting the INDEXOPTION index parameter to 1 (or by leveraging the more flexible TRANSFORMATIONSPEC index parameter if you are on 2016.1).

Class ThePackage.MyClass Extends %Persistent
{
	Property MyStringProperty As %String;
	
	Index MyBasicIndex On (MyStringProperty) As %iFind.Index.Basic(INDEXOPTION=1);
}

The class reference for %iFind.Index.Basic also explains how you can toggle between stemmed and normal search by using the search mode argument:

SELECT * FROM ThePackage.MyClass WHERE %ID %FIND search_index(MyBasicIndex, 'interesting')

for normal search vs using search option 1 for stemmed search:

SELECT * FROM ThePackage.MyClass WHERE %ID %FIND search_index(MyBasicIndex, 'interesting', 1)

 

We do not discard stop words in iFind, in order to ensure you can query for any literal word sequence afterwards. If you start looking at the projections for entities (cf %iKnow.Index.Analytic class ref), you'll see how iKnow offers you a more insightful view of what a sentence is about through the "entity" level, where classic search tech may only offer you the words of a sentence minus the stop words.

 

regards,

benjamin