.png)
Vibe the Module, Not the Data
While working with the FHIR to OMOP Service, I've seen good FHIR synthetic data being created using commercial LLM's etc, custom tailored for ConditionOnset with the typical amazement on return, but witnessed some questionable trust first hand on a call. This approach also falls short generating gigantic payloads so I can go back to my interests on the backend and ensure smooth data transition.
So imposters syndrome quickly surfaced after a couple day hiatus at the 2025 OHDSI Collaborator Showcase out in New Brunswick last October, so a new approach to generating data was in order for any possibility to being invited to cocktail parties with these folks, so I leaned into the work of the pros over at Mitre Corporation that brought us Synthea.
I Immediately noticed a module for the complex Sickle Cell Disease did not exist in the modules folder in the Synthea Repo, but have always known I was afforded the opportunity to write one, but this task would be definitely need da ifferent brain that the OHDSI community seems to have in abundance, but I do not.
The Vibe
Not a huge fan of this term, but it fits the distraction for sure with lack of another term... so given that Synthea Modules generate data based on a "ConditionOnset" lets create a Sickle Cell Disease module and generate a 1m population FHIR Bulk Export from it.
{
"type": "ConditionOnset",
"target_condition": "sickle cell disease"
}
Prompt #1 - Do My Job for Me
Quick Disease Profile for a first-pass SCD module
Very simplified, simulation-oriented view (modeling assumptions only,
not medical advice)
Onset & Genetics - Sickle Cell Disease (SCD) is inherited and present
from birth. - Approximate U.S. birth prevalence: - 1 in 365 Black
births - 1 in 16,300 Hispanic births - Sickle Cell Trait is much more
common (~1 in 13 Black births) but can be treated as “not SCD” for a
first-pass model.
Key Clinical Features to Simulate - Chronic condition: Sickle Cell
Disease (ICD/SNOMED). - Recurrent vaso-occlusive pain crises → ED visits
and hospitalizations. - Increased mortality risk (with simplified
mortality logic).
Treatments to Represent - Hydroxyurea (generic RxNorm entry). -
ED/Hospital pain management (generic opioid/NSAID medications).
Modeling Stance - Oversample SCD so the synthetic population contains
enough SCD patients. - Tune probabilities later to align with CDC
prevalence.
Implement as a single Synthea GMF module with these major state blocks:
State Blocks - Initial: Decide if the patient has SCD at birth. -
No_SCD: Immediate Terminal state for non-SCD patients. - SCD_Diagnosis:
ConditionOnset for SCD. - Chronic_SCD: Loop containing: - Delay between
events - Probabilistic trigger for a Pain Crisis Encounter -
Pain_Crisis_Encounter: - ED encounter - Pain-management medications -
Hydroxyurea_Start (optional): Begin hydroxyurea therapy after first
crisis. - SCD_Death (optional): Elevated mortality risk. - Terminal: End
of module progression (not necessarily patient death).
Prompt #2 - Sure
Things that May be Weak, Race Incidence and Chronic Complications
Prevalence (Decide_SCD distributions). Use real rates if you’re simulating a realistic population, or keep them higher for test populations.
Crisis frequency (Chronic_SCD delay range and crisis probability).
Treatment uptake (Mark_First_Crisis → Hydroxyurea_Start distributions).
Mortality: wire Chronic_SCD or Post_Crisis_Recovery to SCD_Death with age- or crisis-count-based conditions.
If you’d like, next step I can:
Add race-based incidence logic (using Synthea’s built-in demographics), or
Add a chronic complications submodule for stroke / nephropathy and wire it to this core SCD module.
The SCD Module
LGTM! The module that was created cited sources from the CDC almost exclusively, but here it is if you want to take a look at it, also visualized with the synthea visualization utility.
🔗 https://github.com/sween/synthea/blob/43325b191185301a668062ed0bb75a2cf1...

Run
Lets grab the generator, some associated cheat codes, load up our module, and rip the Synthetic Bulk FHIR Export to a zip file.
git clone https://github.com/synthetichealth/synthea
cd synthea
Now, lets steal @Dmitry Zasypkin 's ndjson fixer utility from his repo. This patches the generated ndjson references for processing.
https://raw.githubusercontent.com/dmitry-zasypkin/synthea-ndjson/refs/heads/main/patch-synthea-ndjsons.sh
Enable bulk fhir in the synthea.properties file.
.png)
Also helpful to only care about FHIR Resources relevant to the OMOP CDM
.png)
Then drop the generated SCD module in the modules folder.
.png)
Now run a -p 1m population synthetic generation for the State of Michigan for SCD
.png)
Somewhere in all the terminal noise and cpu fans, you should see that your module was loaded and then off to generate the ndjsons
.png)
In just under an hour, we are now run the patch-synthea-ndjsons.sh across the generated data...
.png)
And zip it all up to bulk fhir export format...
.png)
And here is what it looks like on disk if curious on the sizes
.png)
Load
Upload the bulk fhir payload to the S3 bucket
.png)
Let the OMOP service do its thing...
.png)
Attestation
Although this is generally hand waving to validate the data, lets just see if after transformation if SCD concepts are present in the data.
.png)
Now lets see if anybody has Sickle Cell Diseases in the synthetic data.
.png)
FAQ
Did you use AI for any of this?
I used my computer.
Is the data accurate?
Its synthetic.
Will you get invited to any cocktail parties at the next OHDSI Symposium?
Probably not, this is an oversimplification of complicated observational dataset, but not meant to be offensive.
Any closing statements?
Just vibing this module, even with the 3 prompts, I gained even further appreciation for the complex challenges the OHDSI community solves with this observational data.