Read I/O Performance On >50,000 IOPS

Question

Question

Isaac Aaron · Jun 3, 2016

Hello

During some consultant activity I did at a client's I have discovered something very interesting. It seems like the current processing cycle as written in ObjectScript has trouble utilizing an SSD-based storage machine with five-digit IOPS.I have read some the articles including the one by Tony Pepper here, which is good for reference.

The article by Tony Pepper discusses "Random Read I/o Storage", which was exactly what I was testing. By looking at the "Specifications and Targets" section I can tell immediately what the difference is between performance as perceived by this article (and the others I've read so far) and the challenges we face on a highly capable machine such as the IBM V9000 storage or Hitachi Data Systems G200 (an up) configured with SSDs.

On Tony Pepper's article, the storage system is a 24-disk machine with 10K RPM disks. These machines can generate about 2880 host IOPS. On these conditions Cache excels in performance because disk seek times are considerably slower than the computer's RAM and the CPU's capability to span through data received from the storage machine.

An IT process usually written by a programmer with medium skills benefits greatly from being able to run a $ORDER based loop on a global while performing the usual span of read, process, write tasks. Cache also takes into consideration that write operations are more expensive than read operation by implementing write bursts with the write daemon. Compared to a typical SQL server client cursor process, Cache eliminates the network bottleneck and latency caused by having to exchange data with the client all through the process as it reads data packets, processes and performs atomic writes for each record.

The big difference comes when the storage system is much more capable than 2880 IOPS. New SSD based systems provide typically 60,000 IOPS and I have found that to be a game changer when trying to apply the same work practice as mentioned before ($O, read, process, write). In my test I took a 160GB global and wrote a simple process that starts at a random point from that global, spans forward with a $O run, advancing by about 200 records each 10 records read (to force a new seek out of the current block), reads the node at the current iterator and optionally writes the same data on it. My aim was to squeeze as much as IOPS I can get from the storage controller. I ran multiple jobs of that process experimenting with the numbers by changing the number of read and R/W processes.

The outcome was quite surprising. At first I had some disappointing results and I couldn't see any resource used up its maximum - neither storage nor host. So I then increased the number of jobs and then had to face some challenges keeping the virtual machine responsive. I later discovered that the problem was a break of balance cause by the write daemon waking up for its delayed write which requires CPU while the machine's resources were almost exhausted after I kept increasing load to its maximum capacity.

I then realized that the real issue here is with reads. One ObjectScript Interpeted process on the test machine doing a simple $ORDER run with a simple read (Set X=^GLOBAL(SUBSCRIPT...) ) could only employ the storage machine for a few hundreds of IOPS. The CPU overhead involved in executing the ObjectScript code makes the read ineffective. Each line I optimized raised the IOPS count. The best I got with the test machine was about 8000 IOPS. On that same machine, running an Integrity Check utilizes about 40,000 IOPS and I could never reach that high with ObjectScript code.

Now back to the SQL Server scenario. In SQL Server - what the client usually does is ask for a result set. By asking for a result set it hints the SQL server on how the full data criteria is going to be, and then the SQL server can run its highest possible efficient code to obtain the data in read-ahead chunks. The practice is fairly efficient because the client process does mean to read through all that data and this is not just a statistical mechanism assuming that the client is going to want the next/adjacent block of data. The client-server protocol is also designed to read chunks of data, reducing the network overhead. Writes can also be aggregated by using server-specific techniques (certain kinds of transactions, delayed writes).

What I'm looking for is a similar mechanism where I can hint the server on the required result set and expect a CPU-efficient process to manage the task. ObjectScript will function differently depending on how strong the CPU is, but I estimate that with something considered as a high-end server today is going to be slow on 20,000+ IOPS.

This worries me when I think about the many programmers I know who maintain code migrated from [IMD]SM and still code in ObjectScript. On heavy load processes, their programs usually do nested $ORDER loops doing random reads, processing and writing. For them, usually a storage upgrade used to be salvation. Now it seems like these processes some sort of a limitation that's going to be hard to defeat: random reads on an efficient storage controller requires a handful of CPU time and there's still the CPU time required to do that processing the program wanted to do to the data in the first place. So we might be seeing some weird cache servers consuming 100% CPU while the reason for that might no be inefficient code, but the exact opposite - very efficient code with minimum lines between READ commands.

There could of course be a different direction I have not thought of or I could be running the tests wrong and I will of course appreciate any input.

One final note: Cache is super fast. Those SSD/60,000IOPS high end machines are at this point cutting edge technologies for very specific customers. What concerns me now is how InterSystems is going to handle that kind of hardware if/when it becomes a commodity. I'm quite sure Cache is going to keep being super fast when that future is to become present. Just as it handles machines with 24+ CPUs today when it used to have issues with it in the past.

Discussion (6)1

Log in or sign up to continue

Ray Fucillo · Jun 3, 2016

At a fundamental level the worry that you attribute to ObjectScript is not really particular to ObjectScript or any other language, but rather an issue of parallel vs serial processing. The fundamental issue you're raising here is that when programming at the level of individual database accesses ($order or random gets or whatever) one process is in a loop doing a single database operation, performing some (perhaps minimal) computation, and then doing another database operation. Some of those database operations may require a disk read, but, especially in the $order case, many will not because the block is already cached. When it does need to do a disk read, the process is not doing computation because, well, this is all serial; the next computation depends on what will be read. Imagine the CPU portion of the loop could be magically minimized to zero; even then this process could only keep a single disk busy at a time. However, the disk array you're talking about achieves 50,000 IOPS not from a single disk, but from multiple disks under some theoretical workload that would utilize them all simultaneously.

Integrity check and the write daemons are able to drive more IOPS because they use multiple processes and/or asynchronous I/O to put multiple I/Os in flight simultaneously.

Where language, programming skill, and ObjectScript come in to play is in how readily a program that wishes to put multiple I/Os in flight can do so. ObjectScript enables this, primarily, by giving you controls to start multiple jobs (with the JOB command) and good mechanisms to allow those multiple jobs to cooperate. For a single process, ObjectScript provides $prefetchon to tell the Cache kernel to do disk prefetching asynchronously on behalf of a single process, but that is restricted to only help in sequential-access-type workloads.

Programming constructs that work at a higher level of abstraction (higher than a single global access) may do some parallelization for you. Caché has some of these types of things in many different contexts, but %PARALLEL in SQL, and the work queue manager come to mind. (In SQL Server, you are already programming at this higher level of abstraction and indeed it's not surprising that there's parallelization that can happen without the programmer needing to be aware of it. Under the covers though, this is undoubtably implemented with the sorts of programming constructs I've described: multiple threads of execution and/or asynchronous I/O.)

Of course, how readily a task can be adapted to being parallelized is highly specific to what the task is doing and that is application-specific. Therefore there are tasks for which this disk array has far more capability than an application may use at a given user load. However, even an application for which no single task would ever utilize any where this much disk capability, when scaled up to tens of thousands of users, it may indeed want a disk array like this and make use of it quite naturally. Naturally, not by virtue of the program being written to parallelize an individual tasks, but by having thousands of individual tasks running in parallel.

4 0

score 1 · Answer 1 · 2016-06-03T08:27:50-04:00

Few easy questions first:

- how much memory did you allocate for your global buffers?

- Did you see ^mgstat statistics at the moment your code was busy walking over huge globals?

- and did you play with global prefetching in this case?

P.S.

Let put aside write amplification problem, disable global modifications and attack read performance first.

score 0 · Answer 2 · 2016-06-03T11:15:42-04:00

I did revert to doing only read tests after having understood the issues I'm having with the write daemon.

I'm note sure how global prefetching is going to help because I'm trying to get as random as I can. I'm intentionally trying to override the global buffers and cache internal optimizations in order to utilize the storage as much as I can.

My tests were against a 1TB database running on a 160GB global. The test machine RAM amounted up to 16GB.

score 0 · Answer 3 · 2016-06-03T11:35:02-04:00

Well, my test was parallelized, and I have to say I did mention it.

In my test I took a 160GB global and wrote a simple process that starts at a random point from that global, spans forward with a $O run, advancing by about 200 records each 10 records read (to force a new seek out of the current block), reads the node at the current iterator and optionally writes the same data on it. My aim was to squeeze as much as IOPS I can get from the storage controller. I ran multiple jobs of that process experimenting with the numbers by changing the number of read and R/W processes.

The machine had 12 CPUs. I have achieved the best performance (and highest IOPS from the storage controller) at twice as many jobs than CPUs of that machine - 24 jobs. The controller reported roughly 8,000 IOPS while the virtual machine's CPU power was exhausted. This result of 8,000 IOPS was only achieved after optimizing the loop to include as few lines as I can get (I think the exact number was 6).

Only then I realized that I can't really squeeze anymore from the storage controller with the given server (a 3-year old IBM x3550 M4) with ObjectScript code. Multi-threaded C/C++ code will do that as you also agreed with the IC example.

So the case is really a program that has to run on a huge global, much bigger than the server's global buffers and a storage controller with very low latency and high seek rate (because it uses SSDs). The problem as I see it, is even if one does split the workload into several processes - it would still be ObjectScript (interpreted) code that executes it. It would be the same with SQL because all queries are compiled into mac routines (which is interpreted code) who execute the query.

score 1 · Answer 4 · 2016-06-03T11:43:55-04:00

This sounds very interesting.

I could not give any data proven onclusion without looking into sar or mgstat data, but from your words it sounds like the bottleneck here is ObjectScript VM or engine interprocessor locks implementation. This is hard to believe taking into accont that we are talking about "io bound" experiment, but if you will show us sar metrics...

score 0 · Answer 5 · 2016-06-03T12:28:27-04:00

You might have noticed my reference to these tests in past tence as a thing that happened.

I ran these tests during a POC so I am not sure we have the storage controllers to test against. I already spoke to the client. They will be deciding on the storage system to purchase quite soon.

This type of equipment costs quite a lot and we don't see too many of these deployed at customers. The client is willing to postpone his deployment for a day or two. I can use that time to collect data as you require or have a remote session where you or someone else can see it first hand. I think remote sessions are better than data exchange because I have limited resources on this issue and only a day or two to exchange metrics.