How to tokenize a text using SentenceTransformer?

Question

Question

Kurro Lopez · Mar 25

#Docker #JSON #Python #Vector Search #InterSystems IRIS

Hi all.

I'm trying to create an indexed table with an vector field so I can search by the vector value.
I've been investigating and found that to get the vector value based on the text (token), use a Python method like the following:

ClassMethod TokenizeData(desc As %String) As %String [ Language = python ]
{
    import iris
    # Step 2: Generate Document Embeddings
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('/opt/irisbuild/all-MiniLM-L6-v2')

    # Generate embeddings for each document
    document_embeddings = model.encode(desc)

    return document_embeddings.tolist()
}

The model all-MiniLM-L6-v2 is downloaded from https://ollama.com/library/all-minilm and installed into my Docker instance.

When I've tryed to test this métod (from Visual Studio), it throws the following error:

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'OSError'>: It looks like the config file at '/opt/irisbuild/all-MiniLM-L6-v2/config.json' is not a valid JSON file.

Then I changed the config.json file to create a valid JSon file (I only wrote the curly braces) and repeated the test, but there is a new error.

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'safetensors_rust.SafetensorError'>: Error while deserializing header: HeaderTooSmall

Does anyone know how to fix this problem?
Is there any other way to create the vector value so I can index it?

Best regards.

Product version: IRIS 2024.1

$ZV: IRIS for UNIX (Ubuntu Server LTS for x86-64 Containers) 2024.1.2 (Build 398U) Thu Oct 3 2024 14:20:43 EDT

Discussion (2)2

Log in or sign up to continue

Luis Angel Pére... · Mar 25

https://es.community.intersystems.com/post/%C2%BFc%C3%B3mo-tokenizar-un-...

0 0

score 0 · Answer 1 · 2025-03-25T08:21:53-04:00

Kurro Lopez · Mar 25

Note: If I use this method using Python3 and Jupiter, it works

0 0