"What's taking so long?" - Process Sampling for Performance Analysis |

Article

Ray Fucillo · Dec 1, 2023 13m read

#Best Practices #Debugging #ObjectScript #Performance #System Administration #InterSystems IRIS

When there's a performance issue, whether for all users on the system or a single process, the shortest path to understanding the root cause is usually to understand what the processes in question are spending their time doing. Are they mostly using CPU to dutifully march through their algorithm (for better or worse); or are they mostly reading database blocks from disk; or mostly waiting for something else, like LOCKs, ECP or database block collisions?

Tools to help answer the questions above have always been available in various forms. You start with ^JOBEXAM or the Management Portal's Process view to see a process's Routine, its State, and other tidbits, refreshing frequently to get a sense of what is dominating the process's time. You might then use ^mgstat or ^GLOSTAT to measure total system throughput, or use ^LOCKTAB or ^BLKCOL to see if there are sources of LOCK conflicts or block collisions, though it's not always clear how observations at this level reflect on the processes in question. Lower-level tools like 'iris stat' or OS-level profiling can provide more direct evidence, but involve making inferences about what's going on inside the database kernel. Debuggers and ^%SYS.MONLBL can surely answer a lot of these questions but usually aren't appropriate for use on working production systems.

I created ^PERFSAMPLE to make narrowing in on the root cause of performance issues in the wild quicker and more straight-forward. It's been available in InterSystems IRIS since version 2021.1. PERFSAMPLE samples the state of a set of processes at high frequency, then sorts and counts the sampled data along various dimensions: the process's current routine and namespace, its state string (e.g. GSETW), whether the state is one that indicates waiting or using CPU, the wait state within the database kernel if any, and the PID being sampled (if multiple). The UI then allows you to see the sorted values for each dimension and dig into them in an order of your choosing.

Using PERFSAMPLE doesn't change the behavior of the processes being sampled. It samples information that each process always stores in shared memory, so it has no impact on their performance, and is therefore safe to use on an a live system. The process running PERFSAMPLE itself does of course consume CPU - more as the sample rate or number of processes to sample is increased - but never more than a single CPU thread.

I'm hopeful that this tool might offer you a little more insight into the performance of your application and help make the most of InterSystems IRIS.

A Simple Single-Process Example

Take a simple example of one process that is performing slowly. We'll sample it and perhaps start by looking at what routines are seen most in the samples. In other words, what routine is it spending the most time executing? Is that expected for this application, or is it surprising? Then we might look at the most common State (as ^JOBEXAM or %SYS.ProcessQuery would report). Is it mostly doing global references (e.g. GGET or GSET), doing device IO (READ/WRITE), waiting on a lock (LOCKW), etc? Maybe its mostly doing global references and so we can look at the Kernel Wait State to see if it's mostly waiting or not, and if it is, for what: disk reads, block collisions, journal writes, another internal resource, etc. ^PERFSAMPLE lets you aggregate these dimensions of analysis in a heirarchy you choose, like a pivot table.

Here's what PERFSAMPLE looks like for one process sampled while doing some application activity. We'll look at it first in two dimensions: Using CPU? -> State. Using CPU says whether the sampled state would indicate that the process is running, or at least it could be assuming CPU is available at the system level, as opposed to waiting for something else.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [100 %-total]
 > yes                [90.7 %-total]
   no                 [9.33 %-total]

So this process was spending 90.7% of its time in states where we'd expect it to be using CPU (and indeed this matches the true measure of its CPU time at the operating system level). Now digging in to that 90.7% we find the following states.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [yes] -> Process State [90.7 %-total]
 > RUN                [67.0 %-total]
   GGET               [15.6 %-total]
   GDEF               [5.77 %-total]
   GORD               [1.82 %-total]
   LOCK               [0.509 %-total]

Here we see that some of its CPU time is spent accessing globals (getting values, $order, etc), but it's mostly in other application logic (the general RUN state). What about the time where it wasn't using CPU? We go back and dig into Using CPU? [no].

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [no] -> Process State [9.33 %-total]
 > GDEF               [4.89 %-total]
   GGET               [4.42 %-total]
   GORD               [0.0226 %-total]

We see that when it wasn't using CPU is all global accesses, but that doesn't tell us why, so we go back up to Using CPU? [no] and add the Kernel Wait State.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [no] -> Kernel Wait State [9.33 %-total]
 > diskio             [9.33 %-total]

Now we can see that this portion of its time was reading database blocks from disk.

So, what's our conclusion in this simple example? This process is spending roughly 10% of its time reading the database from disk, 20% of its time doing the rest of what's required for the accessing globals, and 70% in other logic. This particular example shows a reasonable mix that suggests it might be performing about as expected given the application algorithm. If that's too slow, we'll need to understand the application code that it's running and perhaps look for opportunities for improvement or parallelization. If, on the other hand, we had seen that this process was dominated by the diskio wait state, questions about global buffer configuration and underlying storage hardware would come to mind, along with considering opportunities for parallelization or $prefetchon in the application.

In either case, the immediate next step in data collection might be that we end up back in ^JOBEXAM to see exactly what globals it's referencing, but now better informed with the shape of its performance profile. Or we might even decide to use ^TRACE (a new utility in 2021.2+) to follow the exact sequence of global references that it's doing and at what lines of application code.

Multiple Processes

PERSAMPLE can sample multiple or all processes, and the PID from which each sample came is available as a dimension of analysis. So, for example, choosing to analyze Using CPU? -> PID would show the highest CPU users, and Routine -> PID would get the top routines and then the top processes found running each of them. Choosing to analyze the dimensions in the opposite order with the PID first allows you to see the data for the other dimensions sorted separately for each individual processes out of the multiple processes sampled.

Here's what PERFSAMPLE looks like after sampling all processes on the system under some particular application load. I chose the option to Ignore samples where the process appears idle (READ, HANG, etc) to filter out processes that aren't likely to be interesting. Thereby, and as highlighted below, the captured samples are only 27.4% of the total. We''ll start by looking at the Routines that were running in our samples. At the same time, note the CPULoad metric, which is simply expressing the average number of jobs that were Using CPU? [yes] across the samples; if the system had a sufficient number of CPUs to schedule all these jobs and if the performance was fairly uniform across the sample period, this would closely match the number of CPU threads reported busy for IRIS processes at the OS level (e.g. if this system had 8 cores with hyperthreading, the OS might show about 25% utilization: ~4 CPU threads utilized on average out of 16).

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [27.4 %-total]
 > RECOVTEST          [25.0 %-total]
   JRNDMN             [2.38 %-total]
   c                  [0.0402 %-total]
   shell              [0.0280 %-total]
   %SYS.WorkQueueMgr  [0.00132 %-total]
   %SYS.Monitor.AbstractSensor.1          [0.000529 %-total]
   SYS.Database.1     [0.000529 %-total]
   %Library.ResultSet.1                   [0.000264 %-total]

When we're sampling every process on the system, the display above, expressed as percentages of all samples isn't always the most helpful view. The system could have a large number of largely idle processes, with the bulk of the application activity being only a small percentage of the total. We can press 'c' to cycle the display to display the counts as a percentage of the subset.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [27.4 %-total]
 > RECOVTEST          [91.1 %-subset]
   JRNDMN             [8.67 %-subset]
   c                  [0.147 %-subset]
   shell              [0.102 %-subset]
   %SYS.WorkQueueMgr  [0.00482 %-subset]
   %SYS.Monitor.AbstractSensor.1          [0.00193 %-subset]
   SYS.Database.1     [0.00193 %-subset]
   %Library.ResultSet.1                   [0.000965 %-subset]

Or, we can press 'c' to display the counts in terms of the number of processes they represent. It's simply the number of matching samples divided by the number of samples per process, but it's useful because if the performance was fairly uniform across the sample period, this can closely match the number of processes really observed in that state at any one time. Pressing 'c' can also cycle to raw counts. Here's what the top two routines look like in those alternative displays.

Routine [27.4 %-total]
 > RECOVTEST          [9.98 jobs]
   JRNDMN             [0.951 jobs]
...
Routine [103655]
 > RECOVTEST          [94396]
   JRNDMN             [8991]

Note that JRNDMN data point is immediately interesting in this view. We know there's only one journal daemon, and with the sampled job count very close to 1 (95.1%), it was seen as non-idle in almost every sample - remember we told PERFSAMPLE to ignore samples that looked idle (and if we didn't it'd of course be exactly 1). So we immediately learn that there was substantial journal activity. While there are much more direct ways to measure journal activity if we were looking for it, it's the sort of detail that can jump out when we slice the samples in a certain way.

Now, let's focus in on that RECOVTEST routine that was dominating 90% of the non-idle samples. In a real application, the routine names alone would be more telling and might immediately point you an area of interest, but in my simple example, the load I generated was indeed almost all from this one large routine, so we need to look further into what it's doing. With the '>' cursor pointing at RECOVTEST, we'll press '+' and add the State dimension...

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [RECOVTEST] -> Process State [25.0 %-total]
 > GSETW              [55.1 %-subset]
   RUN                [23.1 %-subset]
   BSETW              [7.75 %-subset]
   GGETW              [6.82 %-subset]
   GSET               [3.76 %-subset]
   INCR               [1.71 %-subset]
   BSET               [0.767 %-subset]
   INCRW              [0.407 %-subset]
   GGET               [0.316 %-subset]
   LOCK               [0.279 %-subset]

In the above we see 55% of this routine's time was spent in GSETW, which means it's doing a global SET, but the W means that it's sleeping waiting for something (see the class reference for the %SYS.ProcessQuery's State property). We press '+' again and add the Kernel Wait State. Notice that we're still looking just under these samples of RECOVTEST routine in the GSETW state; we may be interested in going up to start a new analysis with Kernel Wait State as the top dimension but for now we're looking only for the explanation of this one particular set of data points.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [RECOVTEST] -> Process State [GSETW] -> Kernel Wait State [13.8 %-total]
 > inusebufwt         [99.9 %-subset]
   resenqPer-BDB      [0.0577 %-subset]

The 'inusebufwt' state (see the ^PERFSAMPLE documentation) means that this process was waiting due to block collisions: the block that this process wanted to modify was momentarily being used by another process so this one had to wait. Either multiple processes are SETting, KILLing or fetching the same global variable name (global subscript) simultaneously, or there's a "false sharing" pattern where different subscripts being modified and fetched simultaneously happen to be colocated in the same block. Using PERFSAMPLE to return to start a new analysis of Kernel Wait State -> Routine would show all routines that were found in the 'inusebufwt' state. From there, inspection of the application code, and use of ^BLKCOL or ^TRACE would identify the global references that were contending, while ^REPAIR would allow you to see what subscripts are colocated in the blocks in question.

Sampling ECP Requests to the Data Server

PERFSAMPLE includes a special sampling mode for ECP Data Servers. When run on an ECP Data Server and the Sample ECP Server Requests option is used, it samples the incoming ECP requests that the data server is currently processing, including the global or lock name and its subscripts. This can be very helpful in understanding what application activity contributes the most to the load on the data server from the ECP Application Servers. It also samples the process state of the ECP server daemon processing the request, so that the State and Kernel Wait State are available just as in the above examples.