Sharding evaluation #2
As I promised in my previous post Sharding evaluation #1 I continued investigating the impact f the number of shards.
To complete the overview I have also added the instances
on WIN (Server 2012 R2) 8 cores
- Cache for Windows (x86-64) 2016.2.2 - 12 GB global buffers
- IRIS for Windows (x86-64) 2018.1.1 - 400 MB global buffers, no sharding
on LINUX (Ubuntu 16.04 LTS) 2 cores
- IRIS for UNIX (Ubuntu Server LTS for x86-64) 2018.1.1 400MB global buffers
- no shards, 2 shards, 3 shards, 4 shards.
The table I used has 26.5 mio records and is identic on all instances with identic indexing
- SELECT $LISTLENGTH($LISTFROMSTRING(LIST(<property>))) FROM ... WHERE ... %STARTSWITH <value>
this was used with 2 different values resulting in 47000 and 19000 hits.
I refer to it as simple queries S47 and S19
- SELECT $LISTLENGTH($LISTFROMSTRING(LIST(>property1>))) FROM ... WHERE ... %STARTSWITH <value1>
AND <attribute> %INLIST ( SELECT $LISTFROMSTRING(LIST(<property2>)) FROM ... WHERE. .. %STARTSWITH <value2>)
resultig in ~3000 hits.
I refer to it as complex query CX3
- SELECT $LISTLENGTH($LISTFROMSTRING(LIST(>property1>))) FROM ... As A JOIN …… As B ON ……
WHERE ... %STARTSWITH <value1> and …. . .. %STARTSWITH <value2>
resultig in ~5500 hits.
I refer to it as complex query CJ5
This selection was closest to the real application. Others were also interesting in behavior but not so relevant to reality.
As expected and required all test queries delivered the same results across the different configurations
The baseline I refer to is Cache on WIN. Results are expressed by runtime aa percentage of this baseline.
For all values an average of the last 3 runs out of at least 5 was taken to exclude buffering impacts
I think the results don't need much explanation:
- The more shards you have the faster your queries run
- The simpler the query the bigger the performance gain.
- SQL artwork could be counterproductive and might require sophisticated shard key design.
A different observation regarding initial load:
I missed measuring the load times exactly as this wasn't in focus at the beginning.
Anyhow the more shards you have to fill the longer it takes.
Out of stomach, I feel that loading of 4 shards took at least double the time of loading 2 shards
and loading 3 shards was somewhat in between. It didn't surprise me as moving that
amount of date across ECP just can be neither fast nor impressive. But as you see it pays off.
Based on this learning I see 2 options to keep the table up to date:
- direct INSERT, UPDATE, DELETE over SQL
- using DSTIME class parameter to run an update in batch depending on your application.
HTH to decide on YOUR implementation of Sharding.