How do I calculate the difference between two texts using iKnow?

Question

Question

Eduard Lebedyuk · Dec 21, 2016

#InterSystems Natural Language Processing (NLP, iKnow)

I'm in a process of acquiring a corpus of documents on educational courses.

For example there is an educational course called "OOP" and it can have documents from 2008, 2009, ... 2016 etc.
And there are a lot of these courses, each one with programs from different years (hopefully)

So 1 document is 1 programm of one course for one year.

I want to calculate how much does a course changes per year.

Here's an example of information I want to get:

Can I get it via iKnow? How?

Discussion (7)0

Log in or sign up to continue

Benjamin De Boe · Dec 22, 2016

Hi Edward,

the thing that comes closest here would be the %iKnow.Queries.SourceAPI:GetSimilar() query, which for a certain seed document, looks for the most similar ones in a domain, optionally constrained by a filter object. The results of that query include a figure like the one you're looking for, expressing how many entities were new in the seed document vs the corpus it's comparing against. Although that that particular calculation isn't available as an atomic function, a simple way to get to what you want would be to use the %iKnow.Filters.SourceIdFilter and just compare against an individual document.

If you prefer to write more code :o), you can just look up the entities in the one document and compare them against those in the others through the %iKnow.Objects.EntityInSourceDetails SQL projection.

Regards,

benjamin

1 0

score 0 · Answer 1 · 2017-01-08T07:15:00-05:00

Hello.

Thank you for this information. I started testing it and %iKnow.Queries.SourceAPI:GetSimilar() returned the following as a result local:

result(1)=$lb(890,":SQL:2002:20020308X00320",.4737,.9606,57,27,686,.4737)

The list is formed from these values:

$lb(srcId, externalId, percentageMatched, percentageNew, nbOfTgtsInRefSrc, nbOfTgtsInCommon, nbOfTgtsInSimSrc, matchScore)

What does that mean?

srcId -sourceId of similar document
externalId - external source id of similar document
percentageMatched - number of targets common between source and similar documents divided by number of targets in source document
percentageNew - number of targets in similar document that is not present in source document divided by total number of targets in similar document
nbOfTgtsInRefSrc - number of targets in source document
nbOfTgtsInCommon - number of targets common between source and similar documents
nbOfTgtsInSimSrc - number of targets in similar document
matchScore - seems equal to percentageMatched

Is that correct? Are there documentation on that?

score 1 · Answer 2 · 2017-01-09T03:39:59-05:00

you are entirely correct.

The separate MatchScore column is to accommodate methods where the score is more refined than the pure count-based one with $$$SIMSRCSIMPLE. With $$$SIMSRCDOMENT, dominance is accounted for in this metric and you'll see it'll differ from percentageMatched

score 0 · Answer 3 · 2017-01-09T13:31:00-05:00

you are entirely correct.

Good to hear.

$$$SIMSRCDOMENT

If I change algorithm to $$$SIMSRCDOMENT I don't receive any results (Results local is undefined).

If I choose $$$SIMSRCEQUIVS or $$$SIMSRCSIMPLE I get Results local as expected.

What may be the reason? I didn't modify the domain between runs.

Method returns $$$OK and %objlasterror is empty using any algorithm.

score 1 · Answer 4 · 2017-01-10T08:01:11-05:00

The $$$SIMSRCDOMENTS is much more restrictive and may not yield any results if your domain is small and sources are too far apart. I see results when trying it in the Aviation demo dataset. Note that you can loosen it by setting the "strict" parameter to 0 as described in the class ref.

That third alternative you quoted has been deprecated and does not anything to the regular $$$SIMSRCSIMPLE option. You dug too deep in the code ;o)

Regards,
benjamin

score 0 · Answer 5 · 2021-06-24T08:06:00-04:00

Eduard Lebedyuk · Jun 24, 2021

Is there a way to do the same but for the query text and documents, and not document and documents?

0 0

score 0 · Answer 6 · 2016-12-22T05:00:39-05:00

Alexander Koblov · Dec 22, 2016

I would look into some kind of string distance measures. For example, Levenshtein distance.

0 0