Using vector search for duplicate patient detection

Article

Open Exchange

#Embedded Python #Vector Search #HealthShare

I recently had to refresh my knowledge of the HealthShare EMPI module and since I've been tinkering with IRIS's vector storage and search functionalities for a while I just had to add 1 + 1.

For those of you who are not familiar with EMPI functionality here's a little introduction:

Enterprise Master Patient Index

In general, all EMPIs work in a very similar way, ingesting information, normalizing it and comparing it with the data already present in their system. Well, in the case of HealthShare's EMPI, this process is known as NICE:

Normalization: all texts ingested from interoperability production are normalized by removing special characters.
Indexing: indexes are generated from a selection of demographic data to speed up the search for matches.
Comparison: the matches found in the indexes are compared between demographic data and weights are assigned based on criteria according to the level of coincidence.
Evaluation: the possibility of linking patients is evaluated with the sum of the weights obtained.

If you want to know more about HealthShare Patient Index you can review a serie of articles that I wrote some time ago here.

What is the challenge?

While the setup process to obtain the possible links is not extremely complicated, I wondered... Would it be possible to obtain similar results by making use of the vector storage and search functionalities and saving the weight adjustment steps? Let's get to work!

What do I need to implement my idea?

I will need the following ingredients:

IRIS for Health to implement the functionality and make use of the interoperability engine for HL7 messaging ingestion.
Python library to generate the vectors from the demographic data, in this case it will be sentence-transformers.
Model for generating the embeddings, in a quick search on Hugging Faces I have chosen all-MiniLM-L6-v2.

Interoperability Production

The first step will be to configure the production in charge of ingesting the HL7 messaging, transforming it into messages with the most relevant demographic data of the patient, generating the embeddings from said demographic data and finally generating the response message with the possible matches.

Let's take a look at our production:

As you can see, it couldn’t be simpler. I have a Business Service HL7FileService that retrieves HL7 messages from a directory (/shared/in), then sends the message to the Business Process EMPI.BP.FromHL7ToPatientRequestBPL where I will create the message with the patient’s demographic data and finally we will send it to another BP called EMPI.BP.VectorizationBP where the demographic information will be vectorized and the vector search will be performed that will return a message with all the possible duplicate patients.

As you can see, theBP FromHL7ToPatientRequesBPL is very simple:

We transform the HL7 message into a message that we have created to store the demographic data that we have considered most relevant.

Messages between components

We have created two specific type of messages:

EMPI.Message.PatientRequest

Class EMPI.Message.PatientRequest Extends (Ens.Request, %XML.Adaptor)
{

Property Patient As EMPI.Object.Patient;
}

EMPI.Message.PatientResponse

Class EMPI.Message.PatientResponse Extends (Ens.Response, %XML.Adaptor)
{

Property Patients As list Of EMPI.Object.Patient;
}

This type message will contain a list of "possible" duplicated patients.

Let's see the definition of EMPI.Object.Patient class:

Class EMPI.Object.Patient Extends (%SerialObject, %XML.Adaptor)
{

Property Name As %String(MAXLEN = 1000);
Property Address As %String(MAXLEN = 1000);
Property Contact As %String(MAXLEN = 1000);
Property BirthDateAndSex As %String(MAXLEN = 100);
Property SimilarityName As %Double;
Property SimilarityAddress As %Double;
Property SimilarityContact As %Double;
Property SimilarityBirthDateAndSex As %Double;
}

Name, Address, Contact and BirthDateAndSex are the properties in wich we are going to save the most relevant patient demographic data.

Now let's see the magic, the embedding generation and the vector search in the production.

EMPI.BP.VectorizationBP

Embedding and vector search

With the PatientRequest received we are going to generate the embeddings using a method in Python:

Method VectorizePatient(name As %String, address As %String, contact As %String, birthDateAndSex As %String) As %String [ Language = python ]
{
    import iris
    import os
    import sentence_transformers

    try :
        if not os.path.isdir("/iris-shared/model/"):
            model = sentence_transformers.SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")            
            model.save('/iris-shared/model/')
        model = sentence_transformers.SentenceTransformer("/iris-shared/model/")
        embeddingName = model.encode(name, normalize_embeddings=True).tolist()
        embeddingAddress = model.encode(address, normalize_embeddings=True).tolist()
        embeddingContact = model.encode(contact, normalize_embeddings=True).tolist()
        embeddingBirthDateAndSex = model.encode(birthDateAndSex, normalize_embeddings=True).tolist()

        stmt = iris.sql.prepare("INSERT INTO EMPI_Object.PatientInfo (Name, Address, Contact, BirthDateAndSex, VectorizedName, VectorizedAddress, VectorizedContact, VectorizedBirthDateAndSex) VALUES (?,?,?,?, TO_VECTOR(?,DECIMAL), TO_VECTOR(?,DECIMAL), TO_VECTOR(?,DECIMAL), TO_VECTOR(?,DECIMAL))")
        rs = stmt.execute(name, address, contact, birthDateAndSex, str(embeddingName), str(embeddingAddress), str(embeddingContact), str(embeddingBirthDateAndSex))
        return "1"
    except Exception as err:
        iris.cls("Ens.Util.Log").LogInfo("EMPI.BP.VectorizationBP", "VectorizePatient", repr(err))
        return "0"
}

Let's analyze the code:

With sentence-transformer library we are getting the model all-MiniLM-L6-v2 and saving it into the local machine (to avoid further connections by Internet).
With the model imported it allows to us to generate the embeddings for the demographic fields using the encode method.
Using IRIS library we are executing the insert query to persist the embeddings for the patient.

Searching duplicated patients

Now we have the patients recorded with the embeddings generated from the demographic data, let's query it!

Method OnRequest(pInput As EMPI.Message.PatientRequest, Output pOutput As EMPI.Message.PatientResponse) As %Status
{
    try{
        set result = ..VectorizePatient(pInput.Patient.Name, pInput.Patient.Address, pInput.Patient.Contact, pInput.Patient.BirthDateAndSex)
        set pOutput = ##class(EMPI.Message.PatientResponse).%New()
        if (result = 1)
        {
            set sql = "SELECT * FROM (SELECT p1.Name, p1.Address, p1.Contact, p1.BirthDateAndSex, VECTOR_DOT_PRODUCT(p1.VectorizedName, p2.VectorizedName) as SimilarityName, VECTOR_DOT_PRODUCT(p1.VectorizedAddress, p2.VectorizedAddress) as SimilarityAddress, "_
                    "VECTOR_DOT_PRODUCT(p1.VectorizedContact, p2.VectorizedContact) as SimilarityContact, VECTOR_DOT_PRODUCT(p1.VectorizedBirthDateAndSex, p2.VectorizedBirthDateAndSex) as SimilarityBirthDateAndSex "_
                    "FROM EMPI_Object.PatientInfo p1, EMPI_Object.PatientInfo p2 WHERE p2.Name = ? AND p2.Address = ?  AND p2.Contact = ? AND p2.BirthDateAndSex = ?) "_
                    "WHERE SimilarityName > 0.8 AND SimilarityAddress > 0.8 AND SimilarityContact > 0.8 AND SimilarityBirthDateAndSex > 0.8"
            set statement = ##class(%SQL.Statement).%New(), statement.%ObjectSelectMode = 1
            set status = statement.%Prepare(sql)
            if ($$$ISOK(status)) {
                set resultSet = statement.%Execute(pInput.Patient.Name, pInput.Patient.Address, pInput.Patient.Contact, pInput.Patient.BirthDateAndSex)
                if (resultSet.%SQLCODE = 0) {
                    while (resultSet.%Next() '= 0) {
                        set patient = ##class(EMPI.Object.Patient).%New()
                        set patient.Name = resultSet.%GetData(1)
                        set patient.Address = resultSet.%GetData(2)
                        set patient.Contact = resultSet.%GetData(3)
                        set patient.BirthDateAndSex = resultSet.%GetData(4)
                        set patient.SimilarityName = resultSet.%GetData(5)
                        set patient.SimilarityAddress = resultSet.%GetData(6)
                        set patient.SimilarityContact = resultSet.%GetData(7)
                        set patient.SimilarityBirthDateAndSex = resultSet.%GetData(8)
                        do pOutput.Patients.Insert(patient)
                    }
                }
            }
        }
    }
    catch ex {
        do ex.Log()
    }
    return $$$OK
}

Here is the query. For our example we have included a restriction to get patients with a similarity bigger than 0.8 for all the demographics but we could configure it to tune the query.

Let's see the example introducing the file messagesa28.hl7 with HL7 messages like these:

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|269304|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1220395631^^^SERMAS^SN~402413^^^HULP^PI||FERNÁNDEZ LÓPEZ^JOSÉ MARÍA^^^||19700611|M|||PASEO JUAN FERNÁNDEZ^183 2 A^LEGANÉS^CÁDIZ^28566^SPAIN||555749170^PRN^^JOSE-MARIA.FERNANDEZ@GMAIL.COM|||||||||||||||||N|
PV1||N

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|570814|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1122730333^^^SERMAS^SN~018565^^^HULP^PI||GONZÁLEZ GARCÍA^MARÍA^^^||19660812|F|||CALLE JOSÉ MARÍA FERNÁNDEZ^281 8 IZQUIERDA^MADRID^BARCELONA^28057^SPAIN||555386663^PRN^^MARIA.GONZALEZ@GMAIL.COM|||||||||||||||||N|
PV1||N
DG1|1||T001^TRAUMATISMOS SUPERF AFECTAN TORAX CON ABDOMEN, REG LUMBOSACRA Y PELVIS^||20241120|||||||||||^CONTRERAS ÁLVAREZ^ENRIQUETA^^^Dr|

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|40613|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1007179467^^^SERMAS^SN~122688^^^HULP^PI||OLIVA OLIVA^JIMENA^^^||19620222|F|||CALLE ANTONIO ÁLVAREZ^51 3 D^MÉRIDA^MADRID^28253^SPAIN||555638305^PRN^^JIMENA.OLIVA@VODAFONE.COM|||||||||||||||||N|
PV1||N
DG1|1||Q059^ESPINA BIFIDA, NO ESPECIFICADA^||20241120|||||||||||^SANZ LÓPEZ^MARIO^^^Dr|

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|61768|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1498973060^^^SERMAS^SN~719939^^^HULP^PI||PÉREZ CABEZUELA^DIANA^^^||19820309|F|||AVENIDA JULIA ÁLVAREZ^253 1 A^PERELLONET^BADAJOZ^28872^SPAIN||555705148^PRN^^DIANA.PEREZ@YAHOO.COM|||||||||||||||||N|
PV1||N
AL1|1|MA|^Polen de gramineas^|SV^^||20340919051523
MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|128316|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1632386689^^^SERMAS^SN~601379^^^HULP^PI||GARCÍA GARCÍA^MARIO^^^||19550603|M|||PASEO JOSÉ MARÍA TREVIÑO^153 3 D^JEREZ DE LA FRONTERA^MADRID^28533^SPAIN||555231628^PRN^^MARIO.GARCIA@GMAIL.COM|||||||||||||||||N|
PV1||N

In this file, all the patients are different so the result of the operation will be of this type:

The only match is the patient himself, let's introduce the hl7 messages from the messagesa28Duplicated.hl7 with duplicated patients:

As you can see the code has detected the duplicated patient with minor differences in the name (Maruchi is an affectionate nickname for María and Mª is the short way) , well, this case is oversimplified but you can get and idea about the capabilities of the vector search to get duplicated data, not only for patients but any other type of info.

Next steps...

For this example I have used a common model to generate the embeddings but the behaviour of the code would be improved with a fine tuning using nicknames, monikers, etc.

Thank you for your attention!