Introducing iris-synthetic-data-gen
Today I have published a new OpenExchange package for generation of Synthetic Data directly into IRIS.
It can be a frustrating process to find decent datasets when you are looking to make a demo app. Maybe the dataset doesn't matter that much, but you still want it to appear somewhat genuine and with several linked tables that are usable directly within IRIS with the neat implicit joins with ->. Maybe you just want linked tables that are easily installable with IPM to benchmark queries, this dataset generation would be perfect.
I have opted to create datasets using Embedded Python, these datasets are configurable by custom config files. The datasets are generated directly with a single IRIS class method, and can be scaled with a multiplier to create however small or large datasets you want without having to measure configs.
At the moment I have four datasets:
- Financial services (e.g. Bank Cards, accounts, transactions )
- Retail (Stores, Products, Users, Inventory)
- Supply Chain (products, sales orders, inventory movement)
- Theme Park management (parks, zones, rides, incidents)
I am not an expert in any of these domains, so I doubt they are super accurate, and the data generation uses python libraries like faker and statistical weighted generation with numpy, so it all feels a bit synthetic.
I will also be honest that, as a side-of-desk project which I couldn't give a huge amount of time to, this project was only made possible by AI. I used AI extensively for the design of datasets and the generation of the code to create the datasets. I supervised, tested for personal use cases and was very involved with the project design, but the code is all AI generated and I have not carefully reviewed the dataset generation process.
For me, this project is a great use case for full "vibe coding" i.e. letting the agent handle the entire coding process. That is to say, the consequences of bugs is low as these datasets are not designed for any production use. The code can largely be judged on the results outputted, in the knowledge that the details or edge cases don't matter.
Its also a good template to make new datasets - the first of the datasets took me a couple of hours of careful planning, discussion with agents, and iterating as to how best to create the dataset and add it to IRIS. Whereas for the last dataset, I could ask the agent "Create a new dataset with retail tables that is configured and generated like the others here", and it did a pretty good job without any real oversight.
I hope this can be useful for some, and feel free to give feedback, contributions or to use it as a template to make your own synthetic datasets!