Article
· 3 hr ago 5m read

OMOP Odyssey - Vibing Synthea Modules for OMOP

Vibe the Module, Not the Data


While working with the FHIR to OMOP Service, I've seen good FHIR synthetic data being created using commercial LLM's etc, custom tailored for ConditionOnset with the typical amazement on return, but witnessed some questionable trust first hand on a call.  This approach also falls short generating gigantic payloads so I can go back to my interests on the backend and ensure smooth data transition.

So imposters syndrome quickly surfaced after a couple day hiatus at the 2025 OHDSI Collaborator Showcase out in New Brunswick last October, so a new approach to generating data was in order for any possibility to being invited to cocktail parties with these folks, so I leaned into the work of the pros over at Mitre Corporation that brought us Synthea.

I Immediately noticed a module for the complex Sickle Cell Disease did not exist in the modules folder in the Synthea Repo, but have always known I was afforded the opportunity to write one, but this task would be definitely need da ifferent brain that the OHDSI community seems to have in abundance, but I do not.

The Vibe

Not a huge fan of this term, but it fits the distraction for sure with lack of another term... so given that Synthea Modules generate data based on a "ConditionOnset" lets create a Sickle Cell Disease module and generate a 1m population FHIR Bulk Export from it.

{
  "type": "ConditionOnset",
  "target_condition": "sickle cell disease"
}

Prompt #1 - Do My Job for Me

 
Quick Disease Profile for a first-pass SCD module
 
Synthea Module Design

Prompt #2 - Sure

 
Things that May be Weak, Race Incidence and Chronic Complications

The SCD Module

LGTM! The module that was created cited sources from the CDC almost exclusively, but here it is if you want to take a look at it, also visualized with the synthea visualization utility.

🔗 https://github.com/sween/synthea/blob/43325b191185301a668062ed0bb75a2cf1... 


Run

Lets grab the generator, some associated cheat codes, load up our module, and rip the Synthetic Bulk FHIR Export to a zip file.

git clone https://github.com/synthetichealth/synthea
cd synthea

Now, lets steal @Dmitry Zasypkin 's ndjson fixer utility from his repo.  This patches the generated ndjson references for processing.

https://raw.githubusercontent.com/dmitry-zasypkin/synthea-ndjson/refs/heads/main/patch-synthea-ndjsons.sh

Enable bulk fhir in the synthea.properties file.

Also helpful to only care about FHIR Resources relevant to the OMOP CDM

Then drop the generated SCD module in the modules folder.


Now run a -p 1m population synthetic generation for the State of Michigan for SCD

Somewhere in all the terminal noise and cpu fans, you should see that your module was loaded and then off to generate the ndjsons

In just under an hour, we are now run the patch-synthea-ndjsons.sh across the generated data...

And zip it all up to bulk fhir export format...

And here is what it looks like on disk if curious on the sizes

Load

Upload the bulk fhir payload to the S3 bucket

Let the OMOP service do its thing...

Attestation

Although this is generally hand waving to validate the data, lets just see if after transformation if SCD concepts are present in the data.

Now lets see if anybody has Sickle Cell Diseases in the synthetic data.

FAQ

Did you use AI for any of this?

I used my computer.

 Is the data accurate?

Its synthetic.

 Will you get invited to any cocktail parties at the next OHDSI Symposium?

Probably not, this is an oversimplification of complicated observational dataset, but not meant to be offensive.

 Any closing statements?

Just vibing this module, even with the 3 prompts, I gained even further appreciation for the complex challenges the OHDSI community solves with this observational data.

Discussion (0)1
Log in or sign up to continue