Some time back I was facing a quite motivating experience.
Business area: Web analysis
Subject: Build a DB on Webpages and link them with their referencing pages + other attributes as ref_count, ...
Condition: uniqueness of the pages (source + ref)
Target: Load + link/index 500 mio page pairs of source:target web links
HW/OS: 64 GB RAM, 16 Processors, a bunch of lousy raid 5 disk, Red Hat
Customer tried MySQL first:
Stopped after 3 days as the load was just slow and consumed to much disk space.
Next attempt PostgreSQL:
clear better use of disk space, some what faster.
After 3 days sequential loading it was easy to calculate that the 500 mio links might take 47 yrs. (without interrupts)
The loader was modified to work in parallel and over time the forecast decreased below 12 yrs.
Now Caché came in:
No SQL approach anymore.
Each piece got its unique table with reverse references
- protocol (http, https, ftp, ...)
- toplevel domains (.com, .uk, .de, .it, ...)
- domains (intersystems, google, ...)
- servers (www, wrc, mail, ....)
- pages ()
- url_params() [for source only]
The loader did the spliting of URLs and was writing directly to data + index globals
To make it short : 500mio was reached after 4 days
about 10 days later we had to stop at 1.7 billion links as we ran out of data.
consumed disk storage ~ 30% of previous calculated size for competition
Customer was deeply impressed and I felt like an eagle.
Obviously it was a lot of work to get this moving
but the dramatic distance of competition payed off
and the distinc feeling that that you ride the hottest engine available.