Sharding evaluation #1

IRIS brought us a new  WOW feature - SHARDING !
Definitely a great thing!
But how can I find out if it suits my actual applications?
Is there a practical advantage to go for it with my well cooked transactional application?
Or is it just for new still to be designed applications?

I did and still do a series of investigations that I like to share with you.

First:
It means to move from an (almost) single server environment to an environment with multiple narrow bound and synchronized servers.
This means more and especially more qualified control over your instances.
Not necessarily a big deal but definitely something to consider by resource planning.
And of course also for all kind of availability management starting with backup up to failover with mirrors.

Second:
Sharding is only useful if your target tables have a significant size to benefit from parallel processing to compensate for the overhead in cross-server communications over ECP.
^GSIZE tells you where your research should start.
Next, you should find out if your typical result sets are large enough to visualize improvements.
Search for a few records is rather an exercise to apply a suitable index. Of course, you can do it with Sharding as well.
It just could be hard to identify and present the gain to your management.

Based on this ideas a table with 26.5 mio records was selected that typically had resultsets of 1000  to 10000 hits.
With this table, a set of significant queries was selected to compare the performance on a single server against a Sharded setup with 4 Sharding servers around a master.

Differently, to the production environment running on Windows, I decided to use for the 5 test servers Ubuntu for its smaller footprint.
As no isolated LAN for the ECP connections was available I ran most tests during out of business hours to avoid interference with other network traffic.

Surprise #1:
Moving from Win ( 12 GB buffer 8 cores) to Linux (400MB buffer 2 cores)  resulted in a reduction of runtime to 50% and even below even as this was a significantly smaller box. [ CORRECTION 75% see comment ]
I expected Linux to be faster. But the dimension I saw was incredible (and thus verified it several times).
And not being an expert on Linux at all this was just a simple server instance out of the box.

Next phase was to distribute data on 4 shards. I decided to use the default shard key.
Loading took some time but finished overnight.

Surprise #2:
Sharding was clear faster than the single server. (as expected)
Query run time for simple queries went down to 45 ... 80% from the single server.
Less attractive: a complex join took on Sharding at least double the time as on the single server. (the surprise)

So I went back to documentation Choose a Shard Key +  Evaluate Unique Constraints  
My lazy join worked exactly on a UNIQUE Constraint and was unrelated to Shard key !!

So I rebuild my Sharded table again using the main part of the UNIQUE constraint as Shard key.

Surprise #3:
- Load time reduced to ~60%
- Simple queries went down to 15...25% of the single server
- The cumbersome complex join now finished with ~ 50% runtime of the single server.

My personal learning:
- There are situations where reading AND understanding documentation first creates faster success.
- Defaults are not always the best solutions.

Summary:

  • be aware of the effort to maintain additional servers
  • select carefully if your table fits Sharding in size and structure
  • think twice if you select your Shard key 
  • run serious tests to align expectations with reality

My next steps will be evaluating the impact of the number of shards.
I started with the maximum available. Next, I will check to see how it works with 2 and 3 shards in use.

I'll keep you posted once I have collected meaningful figures.

Sharding evaluation #2
 


 

Comments

Great post, Robert! Thank you!

One question:

Moving from Win ( 12 GB buffer 8 cores) to Linux (400MB buffer 2 cores)  resulted in a reduction of runtime to 50% and even below even as this was a significantly smaller box.

Was this 50% reduction because of Sharding or just because of Windows to Ubuntu transition?

Hi Evegeny,

It was an unexpected by-product.
Simply WIN-> Ubuntu. 

It happened at the dry run of single server instance to get a bottom line.. Sharding results refer to this bottom line on Ubuntu.
I first thought I had missed some data. But they were complete.
The only difference is that in WIN there were much more other tables/globals. But no other other users than me active.

CORRECTION: 

The large WIN production config runs Cache for Windows (x86-64) 2016.2.2 

On a parallel version of IRIS with the same 400MB buffers, the figures were much better in favor of  WIN.
But there is still a 25% gap left compared to Ubuntu.

Hi Robert,

thanks for sharing your experiences. Your experiment indeed illustrates the importance of thinking schema design through in order to be able to benefit from fully cosharded JOINs. In that respect, you'll certainly appreciate what we have in stock for 2019.1, where we'll be introducing support for cosharding for any table, so also the ones with the default system-assigned shard key. 

And to be honest, seeing a 75% drop in performance on such a big table when going from 12GB to 400MB buffers doesn't seem like such a bad deal. I'd be curious to see what it would look like if you decrease the global buffers on that windows box to 400MB as well and compare purely on OS/release . We've recently seen a real customer's benchmark go 35% faster upgrading from 2017.2 to a pilot version of 2019.1, without any change to the code or schema (so still nonsharded). So a lot to look forward to in the new year :-) 

Thanks,
benjamin

Thank you, Ben.

It's great to have this vision of future performance. This makes the testing even more worth the effort.
The move to IRIS requires more work than an average upgrade. So the vision to take  a faster "train"
is a strong motivator.  And a strong argument to release the required investments and to see fast returns.

Hi Robert,

We have a partner will also load data into IRIS. I wonder in this test, are you using SQL to insert data into sharded table or using spark to load data? Thank you.

I used SQL INSERT with SELECT on IRIS.
I had a copy of CACHE.DAT from Caché 16.2   
renamed and mounted it on IRIS
mapped the relevant globals to use it as source table