Feeling the power of Caché

Some time back I was facing a quite motivating experience.

Business area: Web analysis
Subject: Build a DB on Webpages and link them with their referencing pages + other attributes as ref_count, ...
Condition: uniqueness of the pages (source + ref)
Target: Load + link/index 500 mio page pairs of source:target web links
HW/OS: 64 GB RAM, 16 Processors, a bunch of lousy raid 5 disk, Red Hat

Customer tried MySQL first:
Stopped after 3 days as the load was just slow and consumed to much disk space. crying

Next attempt PostgreSQL:
clear better use of disk space, some what faster.
After 3 days sequential loading it was easy to calculate that the 500 mio links might take 47 yrs. (without interrupts)
The loader was modified to work in parallel and over time the forecast decreased below 12 yrs. sad

Now Caché came in:
No SQL approach anymore. smiley
Each piece got its unique table with reverse references
- protocol (http, https, ftp, ...)
DNS section:
  - toplevel domains (.com, .uk, .de, .it, ...)
  - domains (intersystems, google, ...)
  - servers (www, wrc, mail, ....)
---
- pages ()
- url_params() [for source only]

The loader did the spliting of URLs and was writing directly to data + index globals
To make it short : 500mio was reached after 4 days laughlaugh
about 10 days later we had to stop at 1.7 billion links as we ran out of data. cool
consumed disk storage ~ 30% of previous calculated size for competition

Customer was deeply impressed and I felt like an eagle.

Obviously it was a lot of work to get this moving devil
but the dramatic distance of competition payed off
and the distinc feeling that that you ride the hottest engine available.

  • + 12
  • 0
  • 532
  • 3

Comments

Robert - thanks for posting, and it's great seeing that you're still doing awesome work in this space!

All the best,

Ben

Robert,

thank you for sharing your experience. I'm just curious:
- What was the final database size in GB?
- What tool will be used for data analyzis? If a homemade one, will it be based on Caché?

 

Alexey,
- The final size  after some design optimizations was
175 GB DataGlobals  + 216 GB IndexGlobals   ;  separated for backup considerations  (on 1 single drive sad )

- data analysis was all done with SQL + a bunch of "homemade" SQL Procedures / ClassMethods running specific subqueries.