How do I calculate the difference between two texts using iKnow?
I'm in a process of acquiring a corpus of documents on educational courses.
For example there is an educational course called "OOP" and it can have documents from 2008, 2009, ... 2016 etc.
And there are a lot of these courses, each one with programs from different years (hopefully)
So 1 document is 1 programm of one course for one year.
I want to calculate how much does a course changes per year.
Here's an example of information I want to get:
Can I get it via iKnow? How?
Hi Edward,
the thing that comes closest here would be the %iKnow.Queries.SourceAPI:GetSimilar() query, which for a certain seed document, looks for the most similar ones in a domain, optionally constrained by a filter object. The results of that query include a figure like the one you're looking for, expressing how many entities were new in the seed document vs the corpus it's comparing against. Although that that particular calculation isn't available as an atomic function, a simple way to get to what you want would be to use the %iKnow.Filters.SourceIdFilter and just compare against an individual document.
If you prefer to write more code :o), you can just look up the entities in the one document and compare them against those in the others through the %iKnow.Objects.EntityInSourceDetails SQL projection.
Regards,
benjamin
Hello.
Thank you for this information. I started testing it and %iKnow.Queries.SourceAPI:GetSimilar() returned the following as a result local:
The list is formed from these values:
What does that mean?
Is that correct? Are there documentation on that?
you are entirely correct.
The separate MatchScore column is to accommodate methods where the score is more refined than the pure count-based one with $$$SIMSRCSIMPLE. With $$$SIMSRCDOMENT, dominance is accounted for in this metric and you'll see it'll differ from percentageMatched
Good to hear.
If I change algorithm to $$$SIMSRCDOMENT I don't receive any results (Results local is undefined).
If I choose $$$SIMSRCEQUIVS or $$$SIMSRCSIMPLE I get Results local as expected.
What may be the reason? I didn't modify the domain between runs.
Method returns $$$OK and %objlasterror is empty using any algorithm.
The $$$SIMSRCDOMENTS is much more restrictive and may not yield any results if your domain is small and sources are too far apart. I see results when trying it in the Aviation demo dataset. Note that you can loosen it by setting the "strict" parameter to 0 as described in the class ref.
That third alternative you quoted has been deprecated and does not anything to the regular $$$SIMSRCSIMPLE option. You dug too deep in the code ;o)
Regards,
benjamin
Is there a way to do the same but for the query text and documents, and not document and documents?
I would look into some kind of string distance measures. For example, Levenshtein distance.