· Jun 7, 2023 15m read

DeDupe an InterSystems® FHIR® Server With FHIR SQL Builder and Zingg

This post backs the demonstration at Global Summit 2023 "Demos and Drinks" with details most likely lost in the noise of the event.

This is a demonstration on how to use the FHIR SQL Capabilities of InterSystems FHIR Server along side the Super Awesome Identity and Resolution Solution, to detect duplicate records in your FHIR repository, and the basic idea behind remediation of those resources with the under construction PID^TOO|| currently enrolled in the InterSystems Incubator program.  If you are into the "Compostable CDP" movement and want to master your FHIR Repository in place you may be in the right spot.


FHIR SQL Builder

This is an easy 3 step process for FHIR SQL.

  • Set up an Analysis
  • Set up a Transform
  • Set up a Projection

The documentation for Zingg is off the chain exhaustive and though extensible and scalable beyond this simple demo, here are the basics.

  • Find
  • Label
  • Train
  • Match
  • Link

Let's get on with it as they say...


We were awarded a full trial of the Health Connect Cloud suite for the duration of the incubator, included in that was the FHIR Server with FHIR SQL Enabled.

The FHIR SQL Builder (abbreviated “builder”) is a sophisticated projection tool to help developers create custom SQL schemas using data in their FHIR (Fast Healthcare Interoperability Resources) repository without moving the data to a separate SQL repository. The objective of the builder is to enable data analysts and business intelligence developers to work with FHIR using familiar analytic tools, such as ANSI SQL, Power BI, or Tableau, without having to learn a new query syntax. 

I know right? Super great, now our "builder" is going to go to work projecting the fields weed need to identify the duplicate the records with Zingg.

First step is the analysis, which you can hardly go wrong clicking this one into submission, simply point the analysis to "localhost" which is essentially the InterSystems FHIR Server underneath.

 The transforms are the critical piece to get right, you need to build these fields out to flatten the FHIR resources so they can be read to make some decisions over the SQL super server.  These most likely will need some thought as the more fields you transform to sql, the better your machine learning model will end up when prospecting for duplicates.

 Now, a data steward type should typically be building these, but I whooped up a few for the demo with questionable simplicity, this was done by using the "clown suit" by hitting the pencil to generate them, but considering the sophistication that may go into them, you can import and export them as well.  Pay special attention to the "Package" and "Name" as this will be the source table for your sql connection.

 This transform is essentially saying we want to use name and gender to detect duplicates.

Patient Transform example

Now, the last part is essentially scheduling the job to project the data to a target schema.  I think you will be presently surprised that as data is added to the FHIR Server, the projections fill automatically, phew.

Now to seal the deal with the setup of FHIR SQL, you can see the projection for Patient (PIDTOO.Patient) being visible, then you create another table to store the output from the dedupe run (PIDTOO.PatientDups).

Another step you will need to complete is enabling the Firewall so that external connections are enabled for your deployment, and you have allowed access for the source CIDR block connecting to the super server.


Mentally bookmark the overview page, as it has the connectivity information and credentials needed to connect before moving to the next step.

DeDupe with Zingg

Zingg is super powerful, OSS, runs on Spark and scales to your wallet when it comes to de-duplication of datasets large and small.  I don't want to oversimplify the task, but the reality is the documentation, functional container are enough to get up and running very quickly.  We will keep this to the brass tacks though to minimally point out what needs to be completed to execute your first de-duplication job fired off against an IRIS Database.


Clone the repo:

We also need the JDBC driver for Zingg to connect with IRIS. Download the IRIS JDBC driver and add the path of the driver to spark.jars property of zingg.conf... organize this jar in the `thirdParty/lib` directory.



Match Criteria

This step takes some thought and can be fun if you are into this stuff or enjoy listening to black box recordings of plane crashes on YouTube.  To demonstrate things, recall the fields we projected from "builder" to establish the match criteria.  All of this is done in our python implementation that declares the PySpark job, which will be revealed in its entirety down a ways.

# Pro Tip!
# These Fields are included in FHIRSQL Projections, but not specified in the Transform
fhirkey = FieldDefinition("Key", "string", MatchType.DONT_USE)
dbid = FieldDefinition("ID", "string", MatchType.DONT_USE)
# Actual Fields from the Projection
srcid = FieldDefinition("IdentifierValue", "string", MatchType.DONT_USE)
given = FieldDefinition("NameFamily", "string", MatchType.FUZZY)
family = FieldDefinition("NameGiven", "string", MatchType.FUZZY)
zip = FieldDefinition("AddressPostalCode", "string", MatchType.ONLY_ALPHABETS_FUZZY)
gender = FieldDefinition("Gender", "string", MatchType.FUZZY)

fieldDefs = [fhirkey, dbid, srcid, given, family,zip, gender]

So the fields match the attributes in our IRIS project and the MatchTypes we set for each field type.  You'll be delighted with what is available as you can immediately put them to good use with clear understanding.

Three common ones are here:

  • FUZZY: Generalized matching with strings and stuff
  • EXACT: No variations allowed, deterministic value, guards against domain conflicts sorta.
  • DONT_USE: Fields that have nothing to do with the matching, but needed in the remediation or understanding in the output.

Some other favorites of mine are here, as they seem to work on dirty data a little bit better and make sense of multiple emails.

  • EMAIL: Hacks off the domain name and the @, and uses the string.
  • TEXT: Things between two strings
  • ONLY_ALPHABETS_FUZZY: Omits integers and non-alphas where they clearly do not belong for match consideration

The full list is available here for the curious.


Create a folder to build your model... this one follows the standard in the repo, create folder `models/700`.

# Object MODEL
args = Arguments()


These values are represented in what we setup in the previous steps on "builder"

# "builder" Projected Object FIELDDEFS
InterSystemsFHIRSQL = Pipe("InterSystemsFHIRSQL", "jdbc")
InterSystemsFHIRSQL.addProperty("dbtable", "PIDTOO.Patient")
InterSystemsFHIRSQL.addProperty("driver", "com.intersystems.jdbc.IRISDriver")
# Use the same password that is on your luggage


Now this table is not a projected table by "builder", it is an empty table we created to house the results from Zingg.

# Zingg's Destination Object on IRIS
InterSystemsIRIS = Pipe("InterSystemsIRIS", "jdbc")
InterSystemsIRIS.addProperty("dbtable", "PIDTOO.PatientDups")
InterSystemsIRIS.addProperty("driver", "com.intersystems.jdbc.IRISDriver")
# Please use the same password as your luggage


If you are trying to understand the flow here, hopefully this will clarify things.

1. Zingg reads the projected data from builder (PIDTOO.Patient)
2. We do some "ML Shampoo" against the data.
3. Then we write the results back to builder (PIDTOO.PatientDups)

Thanks to @Sergei Shutov for the icons!


ML Shampoo

Now, Zingg is a supervised machine learning implementation, so you are going to have to train it up front, and at an interval to keep the model smart.  Its the "rinse and repeat" part of the analogy if you havent gotten the shampoo reference from above.

  • Find - Go get some data
  • Label - Prompt the human to help us out
  • Train - Once we have enough labelled data
  • Match - Zingg writes out the results
  • Link
bash scripts/ --properties-file config/zingg-iris.conf --run pidtoo-iris/ findTrainingData
bash scripts/ --properties-file config/zingg-iris.conf --run pidtoo-iris/ label
bash scripts/ --properties-file config/zingg-iris.conf --run pidtoo-iris/ train

For the find, you will get something a little bit like the below if things are working correctly.


Now, we train the Cylon with supervised learning, lets give it a go.


Now, do what the cylon says, and do this a lot, maybe during meetings or on the Red Line on your way or heading home from work (Get it?  Train).  You'll need enough labels for the train phase, where Zingg goes to town and works its magic finding duplicates.

bash scripts/ --properties-file config/zingg-iris.conf --run pidtoo-iris/ train

Ok, here we go, lets get our results:

bash scripts/ --properties-file config/zingg-iris.conf --run pidtoo-iris/ match

We now have some results back in the PIDTOO.PatientDups table that gets us to the point of things.  We are going to use @Dmitry Maslennikov 's sqlalchemy sorcery to connect via the notebook and inspect our results.

from sqlalchemy import create_engine
# FHIRSQL Builder Cloud Instance
engine = create_engine("iris://fhirsql:1234@")
conn = engine.connect()

query = '''
    TOP 20 z_cluster, z_maxScore, z_minScore, NameGiven, NameFamily, COUNT(*)
    COUNT(*) > 1
result = conn.exec_driver_sql(query)

It takes a little bit to interpret the results, but, here is the result of the brief training on loading the NC voters data into FHIR. ©

The output Zingg gave us is pretty great for the minimal effort I put in training things in between gas lighting.

z_cluster is the id Zingg assigns to the duplicates, I call it the "dupeid", just understand that is the identifier of the you want to query to examine the potential duplicates... Im accustomed to trusting a minScore of 0.00 and anything over 0.90 for a score for examination.

(189, 0.4677305247393828, 0.4677305247393828, 'latonya', 'beatty', 2)
(316, 0.8877195988867068, 0.7148998161578, 'wiloiam', 'adams', 5)
(321, 0.5646965557084127, 0.0, 'mar9aret', 'bridges', 3)
(326, 0.5707960437038071, 0.0, 'donnm', 'johnson', 6)
(328, 0.982044685998597, 0.40717509762282955, 'christina', 'davis', 4)
(333, 0.8879795543643093, 0.8879795543643093, 'tiffany', 'stamprr', 2)
(334, 0.808243240184001, 0.0, 'amanta', 'hall', 4)
(343, 0.6544295790716498, 0.0, 'margared', 'casey', 3)
(355, 0.7028336885619522, 0.7028336885619522, 'dammie', 'locklear', 2)
(357, 0.509141927875999, 0.509141927875999, 'albert', 'hardisfon', 2)
(362, 0.5054569794103886, 0.0, 'zarah', 'hll', 6)
(366, 0.4864567456390275, 0.4238040425261962, 'cara', 'matthews', 4)
(367, 0.5210329255531461, 0.5210329255531461, 'william', 'metcaif', 2)
(368, 0.6431091575056218, 0.6431091575056218, 'charles', 'sbarpe', 2)
(385, 0.5338624802449684, 0.0, 'marc', 'moodt', 3)
(393, 0.5640435106505274, 0.5640435106505274, 'marla', 'millrr', 2)
(403, 0.4687497402769476, 0.0, 'donsna', 'barnes', 3)
(407, 0.5801171648347092, 0.0, 'veronicc', 'collins', 35)
(410, 0.9543673811569922, 0.0, 'ann', 'mason', 7)
(414, 0.5355771790403805, 0.5355771790403805, 'serry', 'mccaray', 2)

Let's pick the "dupeid" 410 and see how we did, the results seem to think there are 7 duplicates.

 Ok, so there are the 7 records, with variable scores... Lets dial it in a little bit more and only report back a score of higher than .90.


So now, if you recall, we have the `MatchType.DONT_USE` for `Key` in our match criteria showing up in our output, but you know what?


These are the FHIR patient resource ids in the FHIR repository we have identified as duplicates and require remediation.

Discussion (0)1
Log in or sign up to continue