How to tokenize a text using SentenceTransformer?
Hi all.
I'm trying to create an indexed table with an vector field so I can search by the vector value.
I've been investigating and found that to get the vector value based on the text (token), use a Python method like the following:
ClassMethod TokenizeData(desc As%String) As%String [ Language = python ]
{
import iris
# Step 2: Generate Document Embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('/opt/irisbuild/all-MiniLM-L6-v2')
# Generate embeddings for each document
document_embeddings = model.encode(desc)
return document_embeddings.tolist()
}The model all-MiniLM-L6-v2 is downloaded from https://ollama.com/library/all-minilm and installed into my Docker instance.
When I've tryed to test this métod (from Visual Studio), it throws the following error:
<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'OSError'>: It looks like the config file at '/opt/irisbuild/all-MiniLM-L6-v2/config.json' is not a valid JSON file.
Then I changed the config.json file to create a valid JSon file (I only wrote the curly braces) and repeated the test, but there is a new error.
<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'safetensors_rust.SafetensorError'>: Error while deserializing header: HeaderTooSmall
Does anyone know how to fix this problem?
Is there any other way to create the vector value so I can index it?
Best regards.
Comments
Note: If I use this method using Python3 and Jupiter, it works