Written by

Senior Development Manager at InterSystems Corporation

STAFF

Question Timothy Leavitt · Jul 28, 2021

%SIMILARITY variations (Okapi BM25+) and iKnow/iFind

#InterSystems Natural Language Processing (NLP, iKnow) #InterSystems IRIS #iFind #SQL

I'm working in an application that uses %SIMILARITY to find matches among a set of documents that vary greatly in length. It's generally good but I've noticed issues with ranking short partially-matching documents over longer documents that match the search string entirely.

Reading up on the Okapi BM25 ranking function (which is what %SIMILARITY / the %Text package use) at https://en.wikipedia.org/wiki/Okapi_BM25 I see mention of the BM25+ modification, which "was developed to address one deficiency of the standard BM25 in which the component of term frequency normalization by document length is not properly lower-bounded; as a result of this deficiency, long documents which do match the query term can often be scored unfairly by BM25 as having a similar relevancy to shorter documents that do not contain the query term at all." This seems like exactly what I need.

I'm likely to go down the rabbit hole of implementing BM25+ in a %Text.English subclass and will share my results in an article when I do... but before I do that, I'm curious if iFind has some new-and-improved equivalent to %SIMILARITY, ideally that would just be a drop-in replacement for it. Has anyone worked with this sort of thing before?

Product version: IRIS 2021.1

Discussion (3)1

Add reply

Comments

Eduard Lebedyuk · Jul 28, 2021

Might be relevant.

0 0

Benjamin De Boe · Jul 29, 2021

The iFind search portal demo includes a simple class query to find similar documents within a single iFind index. It's only pretty basic and somewhat picky (assuming the demo setup), building on the dominance score for each entity, and may not guard against that difference in length issue you're seeing with BM25. There is a similar method in iKnow when your data would already be in an iKnow domain.

There would indeed be value in providing %SIMILARITY support for iFind indexed fields, leveraging the standard/enhanced algorithm on top of word tokens. I'll log that as an enhancement request and we can follow up internally. Obviously, I'm interested in experiences or advice of other DC members here

0 0

Timothy Leavitt Jul 29, 2021 to Benjamin De Boe

Thank you!

0 0