- Log in to post comments
Hello Rodolfo, thank you. This is interesting, I'll try with a folder outside of OneDrive.
- Log in to post comments
Hi André, the response time really depends on many different factors like model size, how you optimize your data, and hardware. For this project, passing a limited number of records from FHIR and converting them to YAML (instead of raw JSON) kept the prompt context tiny, reducing processing time. Additionally, lightweight models (1B/2B) using 4-bit quantization (default in Ollama) only require between 0.6 GB and 1.6 GB of RAM, so they fit on an 8 GB RAM laptop. Moreover, beyond the amount of RAM, having fast memory bandwidth (like DDR5) significantly reduces the time the CPU needs to read the model weights from memory. In this way, a small model paired with a tight context can comfortably generate 20 to 40 tokens per second on a standard CPU.
- Log in to post comments
Hi, I'll be available for a coffee chat on June 23th or 24th in a time slot from 18.15 to 20.00.
I'm based in Rome, Italy (currently GMT+2).
Feel free to answer if you'd like to talk about agentic AI and MCP server.
See you 😃