Textual Similarity Comparison using IRIS, Python, and Sentence Transformers

Article

Robbie Luman · Jan 12, 2024 7m read

#Artificial Intelligence (AI) #Best Practices #Embedded Python #Natural Language Processing #ObjectScript #Python #Vector Search #InterSystems IRIS

With the advent of Embedded Python, a myriad of use cases are now possible from within IRIS directly using Python libraries for more complex operations. One such operation is the use of natural language processing tools such as textual similarity comparison.

Setting up Embedded Python to Use the Sentence Transformers Library

Note: For this article, I will be using a Linux system with IRIS installed. Some of the processes for using Embedded Python with Windows, such as installing libraries, may be a bit different from Linux to Windows so please refer to the IRIS documentation for the proper procedures when using Windows. With the libraries installed, however, the IRIS code and setup should be the same between Linux and Windows. This article will use Redhat Linux 9 and IRIS 2023.1.

The Python library used for this article will be Sentence Transformers (https://github.com/UKPLab/sentence-transformers).To be able to use any Python library from within IRIS, it must be installed in such a way that IRIS can make use of it. In Linux, this can be done by using the standard Python library installation command but with a target of the IRIS instance's Python directory. Note that for Sentence Transformers, your system must also already have a Rust compiler installed as a pre-requisite for one of the library dependencies installed alongside sentence-transformers (https://www.rust-lang.org/tools/install).

sudo python -m pip install sentence-transformers --target [IRIS install directory]/sys/mgr/python

Once installed, Sentence Transformers can be used directly within the IRIS Python shell

Text Comparison Using the Sentence Transformers Library

In order to carry out a comparison of two separate texts, we first need to generate the embeddings for each text block. An embedding is a numerical representation of the text based on the linguistic construction of the text using the given language model. For Sentence Transformers, we can use several different pre-trained language models (https://www.sbert.net/docs/pretrained_models.html). In this case, we'll use the "all-MiniLM-L6-v2" model which is listed as an "All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs." and should be sufficient for this article. There are many other pre-trained models to choose from, however, including some multilingual models that can detect and make use of many different languages so choose the model that works best for your specific use case.

The code used above is how we can generate the embeddings, which will convert the given string of text into a Python vector that can then be used in the comparison function.

Once the embeddings for each text block have been created, these can then be used in the cosine_similarity function to arrive at a value from 0 to 1 representing the relative similarity between the given texts.

The above example shows that the two text strings given are roughly 64% similar based on their linguistic structure.

Creating and Storing Text Embeddings using IRIS Embedded Python and Sentence Transformers

Using this same process, we can incorporate this process into an IRIS class and store the embeddings for future use, such as comparing an existing text to a newly entered text.

Create a new class in IRIS that extends %Persistent so we can store data against the subsequent table.

Create two string properties on the class, one for the text itself and another to store the embedding data. It is important to include the MAXLEN = 100000 on the Embedding field, otherwise the embedding data will not fit in the field and the comparison will fail. The MAXLEN on the Text field will depend on your use case for how much text may be stored.

Class TextSimilarity.TextSimilarity Extends %Persistent
{
    Property Text As %String(MAXLEN = 100000);
    Property Embedding As %String(MAXLEN = 100000);
}

Note: In the current version of IRIS as of writing this article (IRIS 2023), the embedding data must be converted from the tensor object, a form of Python vector, to a string for storage since no current native datatypes are compatible with the Python vector. In the near future (expected IRIS 2024.1 as of writing this article), a native vector datatype is being added to IRIS that will allow storage of Python vectors directly within IRIS without the need for string conversion.

Create a new ClassMethod that takes in a string, creates the embedding, converts the embedding to a string value, and then returns the resulting string. Use the method decorator as shown below to indicate to IRIS that this method will be a native Python method rather than ObjectScript.

ClassMethod GetTextEmbedding(text As %String) As %String [ Language = python ]
{
    from sentence_transformers import SentenceTransformer
    from pickle import dumps

    model = SentenceTransformer('all-MiniLM-L6-v2')
    embedding = model.encode(text, convert_to_tensor=True)
    embeddingList = embedding.tolist()
    pickledObj = dumps(embeddingList, 0)
    pickledStr = pickledObj.decode('utf-8') 
    return pickledStr
}

In the method above, the first part of the code is similar to what was done in the Python shell to create the embedding. Once the tensor object has been created from model.encode, the conversion to a string proceeds as:

Convert the tensor object obtained from model.encode to a Python list object
Convert the list object to a Python pickled object using the dumps function from the built-in pickle library
1. Note: Make sure to include the 0 parameter as shown above as it forces the dumps function to use an older method of encoding that works properly with UTF-8 encoding
Decode the pickled object to a UTF-8 encoded string

Now, the tensor object has been fully converted to a string and can be returned to the calling function.

Next, create a ClassMethod that will use ObjectScript to bring in the text, setup the new object for database storage, call the GetTextEmbedding method, then store the text and embedding to the database.

ClassMethod NewText(text As %String)
{
    set newTextObj = ..%New()
    set newTextObj.Text = text
    set newTextObj.Embedding = ..GetTextEmbedding(text)
    set status = newTextObj.%Save()
    write status
}

Now we have a way to send a text string into the class, have it create the embeddings, and store both to the database.

Comparing Text Using IRIS Embedded Python and Sentence Transformers

Moving on to the comparison process, I will again create a Python class method to handle the actual conversion of the embedding string back to a tensor and run the comparison along with an ObjectScript class method to handle retrieving the data from the database and calling the Python similarity method.

First, the Python similarity method.

ClassMethod GetSimilarity(embed1 As %String, embed2 As %String) As %String [ Language = python ]
{
    from sentence_transformers import util
    from pickle import loads
    from torch import tensor

    embed1Bytes = bytes(embed1, 'utf-8')
    embed1List = loads(embed1Bytes)

    embed2Bytes = bytes(embed2, 'utf-8')
    embed2List = loads(embed2Bytes)

    tensor1 = tensor(embed1List)
    tensor2 = tensor(embed2List)

    cosineScores = util.cos_sim(tensor1, tensor2)

    return str(abs(cosineScores.tolist()[0][0]))
}

In the above method, each embedding is converted to a byte string using the bytes function (using UTF-8 again since that is what was used to decode them in the encoding conversion process), and then the loads function from the pickle library will take the byte string and convert it to a Python list object. Finally, the tensor function from the torch library will complete the conversion process by converting the Python list into a proper tensor object, ready to be used in the cosine similarity comparison.

Then, the cos_sim function from the util library of Sentence Transformers is called to compare the tensors and return the similarity score. This return, however, will be formatted as a tensor as well so it will need to be converted to a list and dereferenced to the first and only element in the list]. Then, due to the geometric nature of the cosine similarity, take the absolute value and finally convert the resulting decimal value to a string to pass back to the calling function.

From here, we can then create the ObjectScript function to handle the database operations to retrieve the specific text objects and call the similarity function.

ClassMethod Compare(id1 As %Integer, id2 As %Integer)
{
    set obj1 = ..%OpenId(id1)
    set text1 = obj1.Text
    set embedding1 = obj1.Embedding

    set obj2 = ..%OpenId(id2)
    set text2 = obj2.Text
    set embedding2 = obj2.Embedding

    set sim = ..GetSimilarity(embedding1, embedding2)

    write !,"Text 1:",!,text1,!
    write !,"Text 2:",!,text2,!
    write !,"Similarity: ",sim
}

This method will open each object using the provided row ID, get the text and embedding data, pass the embeddings to the GetSimilarity method to get the similarity score, and then write out each text and the similarity.

Testing

To test, I will use the synopsis of the movie Hot Fuzz from two different sources (IMDB and Wikipedia) to see how similar they are. If everything is working as expected, the similarity should be sufficiently high score.

First, I'll store each text object to the database using the NewText function:

Once stored, the data should look like this (ignore the IDs starting at 3, I had a bug in my initial code and had to redo the save).

Now that our text data has been stored along with it's associated embedding, we can call the Compare function with IDs 3 and 4 to get the final comparison score:

As a result, we see that the similarity between the IMDB and Wikipedia synopses for Hot Fuzz are about 75% similar, which I would consider to be accurate given how much more text there is in the IMDB text.

Closing

This is just one of the many functions available in Sentence Transformers, plus there are numerous other NLP toolkits out there that can carry out various functions relating to text and language analysis. The point of this article is to show that if you can do it in Python, you can do it in IRIS as well using Embedded Python.