Feeling the power of Caché
Some time back I was facing a quite motivating experience.
Business area: Web analysis
Subject: Build a DB on Webpages and link them with their referencing pages + other attributes as ref_count, ...
Condition: uniqueness of the pages (source + ref)
Target: Load + link/index 500 mio page pairs of source:target web links
HW/OS: 64 GB RAM, 16 Processors, a bunch of lousy raid 5 disk, Red Hat
Customer tried MySQL first:
Stopped after 3 days as the load was just slow and consumed to much disk space.
Next attempt PostgreSQL:
clear better use of disk space, some what faster.
After 3 days sequential loading it was easy to calculate that the 500 mio links might take 47 yrs. (without interrupts)
The loader was modified to work in parallel and over time the forecast decreased below 12 yrs.
Now Caché came in:
No SQL approach anymore.
Each piece got its unique table with reverse references
- protocol (http, https, ftp, ...)
DNS section:
- toplevel domains (.com, .uk, .de, .it, ...)
- domains (intersystems, google, ...)
- servers (www, wrc, mail, ....)
---
- pages ()
- url_params() [for source only]
The loader did the spliting of URLs and was writing directly to data + index globals
To make it short : 500mio was reached after 4 days
about 10 days later we had to stop at 1.7 billion links as we ran out of data.
consumed disk storage ~ 30% of previous calculated size for competition
Customer was deeply impressed and I felt like an eagle.
Obviously it was a lot of work to get this moving
but the dramatic distance of competition payed off
and the distinc feeling that that you ride the hottest engine available.
Robert - thanks for posting, and it's great seeing that you're still doing awesome work in this space!
All the best,
Ben
Robert,
thank you for sharing your experience. I'm just curious:
- What was the final database size in GB?
- What tool will be used for data analyzis? If a homemade one, will it be based on Caché?
Alexey,
)
- The final size after some design optimizations was
175 GB DataGlobals + 216 GB IndexGlobals ; separated for backup considerations (on 1 single drive
- data analysis was all done with SQL + a bunch of "homemade" SQL Procedures / ClassMethods running specific subqueries.