· Mar 22, 2016

GetSimilar only returns some terms, what about the others?


We use iKnow's GetSimilar for decision making. Right now we have a domain with both good and bad documents and using GetSimilar we want to see if a document is more similar to the good ones or the bad ones. To do this we simply compare the weighted average of the score from the good ones and the bad ones that GetSimilar returns.  

The problem is that GetSimilar doesn't always return the score to all other documents. Assuming we have 50 documents I would expect the following result:

DO ##class(%iKnow.Queries.SourceAPI).GetSimilar(.sim,domId,id,1,200,"",$$$SIMSRCSIMPLE, $LB("ent"))

sim(1)=$lb(1,":FILE:c:\file2.txt", 0.4239, 0.8615, 184, 78, 563, 0.4239)
sim(2)=$lb(2,":FILE:c:\file3.txt", 0.3967, 0.7704, 184, 73, 318, 0.3967)
sim(49)=$lb(49,":FILE:c:\file49.txt", 0.3967, 0.7704, 184, 73, 318, 0.3967)


But for some documents the result is less then 49 lines. Does that mean score is 0? 

Are there any ways to force  it to print the score even if the score is "too low"?


Discussion (1)0
Log in or sign up to continue

Hi Benjamin,

the default algorithm indeed won't return scores for each record, but will only make the calculation for all records that contain at least a decent number of entities that are relevant in the source document. You can indeed simply approximate the other documents' score by taking 0.

For your specific use case, you may want to take a look at the text categorization infrastructure. I've posted a tutorial on the topic here.