Integrity Check: Speeding it Up or Slowing it Down

Primary tabs

While the integrity of Caché and InterSystems IRIS databases is completely protected from the consequences of system failure, physical storage devices do fail in ways that corrupt the data they store.  For that reason, many sites choose to run regular database integrity checks, particularly in coordination with backups to validate that a given backup could be relied upon in a disaster.  Integrity check may also be acutely needed by the system administrator in response to a disaster involving storage corruption.  Integrity check must read every block of the globals being checked (if not already in buffers), and in an order dictated by the global structure. This takes substantial time, but integrity check is capable of reading as fast as the storage subsystem can sustain.  In some situations, it is desirable to run it in that manner to get results as quickly as possible.  In other situations, integrity check needs to be more conservative to avoid consuming too much of the storage subsystem’s bandwidth. 

Plan of Attack

This following outline caters for most situations.  The detailed discussion in the remainder of this article provides the necessary information to act on any of these, or to derive other courses of action. 

  1. If using Linux and integrity check is slow, see the information below on enabling Asynchronous I/O. 
  2. If integrity check must complete as fast as possible - running in an isolated environment, or because results are needed urgently - use Multi-Process Integrity Check to check multiple globals or databases in parallel.  The number of processes times the number of concurrent asynchronous reads that each process will perform (8 by default, or 1 if using Linux with asynchronous I/O disabled) is the limit on the number of concurrent reads in flight.  Consider that the average may be half that and then compare to the capabilities of the storage subsystem.  For example, with storage striped across 20 drives and the default 8 concurrent reads per process, five or more processes may be needed to capture the full capacity of the storage subsystem (5*8/2=20).
  3. When balancing integrity check speed against its impact on production, first adjust the number of processes in the Multi-Process Integrity Check, then if needed, see the SetAsyncReadBuffers tunable.  See Isolating Integrity Check below for a longer-term solution (and for eliminating false positives).
  4. If already confined to a single process (e.g. there’s one extremely large global or other external constraints) and the speed of integrity check needs adjustment up or down, see the SetAsyncReadBuffers tunable below.

Multi-Process Integrity Check

The general solution to get an integrity check to complete faster (using system resources at a higher rate) is to divide the work among multiple parallel processes.  Some of the integrity check user interfaces and APIs do so, while others use a single process.  Assignment to processes is on a per-global basis, so checking a single global is always done by just one process (versions prior to Caché 2018.1 divided the work by database instead of by global).

The principal API for multi-process integrity check is CheckLIst^Integrity (see documentation for details). It collects the results in a temporary global to be displayed by Display^Integrity. The following is an example checking three databases using five processes. Omitting the database list parameter here checks all databases.

set dblist=$listbuild(“/data/db1/”,”/data/db2/”,”/data/db3/”)
set sc=$$CheckList^Integrity(,dblist,,,5)
do Display^Integrity()
kill ^IRIS.TempIntegrityOutput(+$job)

/* Note: evaluating ‘sc’ above isn’t needed just to display the results, but...
   $system.Status.IsOK(sc) - ran successfully and found no errors
   $system.Status.GetErrorCodes(sc)=$$$ERRORCODE($$$IntegrityCheckErrors) // 267
                           - ran successfully, but found errors.
   Else - a problem may have prevented some portion from running, ‘sc’ may have 
          multiple error codes, one of which may be $$$IntegrityCheckErrors. */

Using CheckLIst^Integrity like this is the most straight-forward way to achieve the level of control that is of interest to us.  The Management Portal interface and the Integrity Check Task (built-in but not scheduled) use multiple processes, but may not offer sufficient control for our purposes.*

Other integrity check interfaces, notably the terminal user interface, ^INTEGRIT or ^Integrity, as well as Silent^Integrity, perform integrity check in a single process. These interfaces, therefore, do not complete the check as fast as it's possible to achieve, and they use fewer resources.  An advantage, though, is that their results are visible, logged to a file or output to the terminal, as each global is checked, and in a well-defined order.

Asynchronous I/O

An integrity check process walks through each pointer block of a global, one at a time, validating each against the contents of the data blocks it points to.  The data blocks are read with asynchronous I/O to keep a number of read requests in flight for the storage subsystem to process, and the validation is performed as each read completes. 

On Linux only, async I/O is effective only in combination with direct I/O, which is not enabled by default until InterSystems IRIS 2020.3.  This accounts for a large number of cases where integrity check takes too long on Linux.  Fortunately, it can be enabled on Cache 2018.1, IRIS 2019.1 and later, by setting wduseasyncio=1 in the [config] section of the .cpf file and restarting.  This parameter is recommended in general for I/O scalability on busy systems and is the default on non-Linux platforms since Caché 2015.2.  Before enabling it, make sure that you’ve configured sufficient memory for database cache (global buffers) because with Direct I/O, the databases will no longer be (redundantly) cached by Linux.  When not enabled, reads done by integrity check complete synchronously and it cannot utilize the storage efficiently. 

On all platforms, the number of reads that an integrity check process will put in flight at one time is set to 8 by default.  If you must alter the rate at which a single integrity check process reads from disk this parameter can be tuned – up to get a single process to complete faster, down to use less storage bandwidth.  Bear in mind that:

  • This parameter applies to each integrity check process.  When multiple processes are used, the number of processes multiplies this number of in-flight reads  Changing the number of parallel integrity check processes has a much larger impact and therefore is usually the first thing to do.  Each process is also limited by computational time (among other things) so there increasing the value of this parameter is limited in its benefit.
  • This only works within the storage subsystem’s capacity to process concurrent reads. Higher values have no benefit if databases are stored on a single local drive, whereas a storage array with striping across dozens of drives can process dozens of reads concurrently.

To adjust this parameter from the %SYS namespace, do SetAsyncReadBuffers^Integrity(value). To see the current value, write $$GetAsyncReadBuffers^Integrity(). The change takes effect when the next global is checked.  The setting currently does not persist through a restart of the system, though it can be added to SYSTEM^%ZSTART.

There is a similar parameter to control the maximum size of each read when blocks are contiguous on disk (or nearly so).  This parameter is less often needed, though systems with high storage latency or databases with larger block sizes could possibly benefit from fine tuning.  The value has units of 64KB, so a value of 1 is 64KB, 4 is 256KB, etc.  0 (the default) lets the system to select and it currently selects 1 (64KB).  The ^Integrity function for this parameter, parallel to those mentioned above, are SetAsyncReadBufferSize and GetAsyncReadBufferSize.

Isolating Integrity Check

Many sites run regular integrity checks directly on the production system. This is certainly the simplest to configure, but it’s not ideal.  In addition to concerns about integrity check’s impact on storage bandwidth, concurrent database update activity can sometimes lead to false positive errors (despite mitigations built into the checking algorithm).  As a result, errors reported from an integrity check run on production need to be evaluated and/or rechecked by an administrator.

Often times, a better option exists.  A storage snapshot or backup image can be mounted on another host, where an isolated Caché or IRIS instance runs the integrity check.  Not only does this prevent any possibility of false positives, but if the storage is also isolated from production, integrity check can be run to fully utilize the storage bandwidth and complete much more quickly.  This approach fits well into the model where integrity check is used to validate backups; a validated backup effectively validates production as of the time the backup was made.  Cloud and virtualization platforms can also make it easier to establish a usable isolated environment from a snapshot.

 


The Management Portal interface, the Integrity Check Task and the IntegrityCheck method of SYS.Database select a rather large number of processes (equal to the number of CPU cores), lacking the control that’s needed in many situations. The management portal and the task also perform a complete recheck of any global that reported error in effort to identify false positives that may have occurred due to concurrent updates. This recheck occurs above and beyond the false positive mitigation built into the integrity check algorithms, and that may be unwanted in some situations due to the additional time it takes (the recheck runs in a single process and checks the entire global). This behavior may be changed in the future.

Replies

What would you recommend, if the amount of data which have to be checked is quite large, let's say 100TB? For sure, it's not one solid database file, split by many files with sizes up to about 6TB. But most of the data is alive.

I think integrity check isn't the primary driver of that architectural decision, but it might be part of the consideration.  Any single database is constrained to a max size of 2^32 blocks, so 32TB for standard 8KB block size.  There's practical reasons not to go anywhere near that high: backup/restore and other operational tasks on a single database may be more onerous,  AIX/JFS2 has a 16TB file limit anyway, integrity check has less ability to be parallelized if the huge database is also primarily a single global, (and if you're running older versions there's a couple bugs involving databases that have more than 2^31 blocks, all fixed in latest maintenance kits).

Given these and other considerations, I believe most sites shoot for max database sizes somewhere between 2 and 10 TB.   So for 100TB we're talking about a few dozen databases.   You'd hope that much data, especially if it's largely in active use, is spread over a significant number of different globals (e.g. many tables and their indices).  Ideally you use global mappings in anticipation of such huge growth to organize the globals into databases and as much as possible avoid the need to use subscript level mapping (SLM) to manage growth of a single global across multiple databases.  If growth is unbounded though (ie this isn't data that sort of data that can eventually be moved to some separate archive structure) then subscript level mapping to map across these dozen or more databases becomes inevitable. 

As for running integrity check on that much data, it will take some substantial time and you need to find the balance of how frequently you want to run it, how much storage bandwidth is reasonable for it to consume, and whether you can run it on an offline copy.  Since the other factors I mentioned already put you into have a multitude of separate databases (with any giant globals spread over some number of them via SLM), integrity check will be able to be well parallelized.

In that particular case, we have 16k blocks, due to the past issues with caching big string blocks over the ECP. 

But I think, there are a few ways how integrity can be improved, for such cases. I see at least two reasons why we should check integrity periodically.

  •  We don't have any errors in the database, which may cause to a system failure.
  • We don't have any issues in the database, and we ensure that our data is completely available.

I'm faced with an issue when error on a pointer level causes issues with WriteDaemon, and our system just died, when the application tried to get access to the data. And it took some time to figure out why it has happened, even when we did not have any issues with database at all, just only with ECP. That happened in the version 2012.2. And I'm thinking I would be able to set how deeply I could scan blocks, let's say, don't care about data blocks, just scan only pointers blocks. I don't have proportions, but I'm sure that in most cases we would have much more data blocks than pointers blocks. So, it would make integrity check give some results faster.

I know quite well how the database looks inside. But I did not manage, yet to look at how database backups work, and mostly interesting incremental backups. As I know backup works with blocks, so, maybe there is a way to make incremental integrity checks as well. It will not help to find the issues that happened in unchangeable blocks due to hardware issues but could say, that lately changed data is Ok.

From my perspective, the main reason to run integrity check is so that if you ever did have database degradation, you know that you have a backup that you can recover from.  I've seen too many disasters of the form that corruption is discovered that predates any available backup.  For use cases that would never recover from backup or mirrored copies or the like for disaster recovery, you might reasonably argue that integrity check isn't worth the effort/cost.   

(As a detail, just accessing a corrupted global won't  hang the system, but the system will hang if corruption causes a SET or KILL to fail in the middle of a multi-block update.)

Anyway, to your good thoughts about possible enhancements:

  • It turns out that one of my recent enhancements, as yet unreleased, did open up a possibiility of a "pointer block only" check (as a side effect of a different goal).  However, I don't think it's very valuable because pointer blocks make up a very small fraction of all blocks.  For typical patterns of subscripts, there's in the neighborhood 300-500 data blocks pointed to by a pointer block in 8KB databases, so you're talking about ~0.2% of all the blocks.  I don't think you'd draw any meaningful conclusion from a clean check that didn't include data blocks.  Don't be confused that most integrity check error starts with "Error while processing pointer block %d".  That's just the way integrity check works.  The vast majority of those are from a bottom pointer block and were found only because it read every data block under it to find the inconsistency.
  • We do actually have some protection against errors in the most recently written blocks (following a crash) via the Write Image Journal block comparison.  It's a totally different mechanism, but it is designed with the thought that when systems lose power, there's some history of drives dropping or corrupting the most recent writes, despite promises that they had already succeeded (via our very careful use of fsync() and similar mechanisms).
  • About piggy-backing on increment change tracking, it's an interesting idea, but again I worry that many of the failure modes that lead to corruption wouldn't necessarily get uncovered, and so it doesn't give the guarantee you need from integrity check in order to know that a backup image could be relied upon in a disaster.  

Unfortunately, for some reasons some systems may not use the latest versions of InterSystems products. While Integrity checks in some cases can be used on lower versions. And from the other side, for some systems reversing the system to some backup can be used as only last chance to restore the data, due to the sensitivity of stored data and the impossibility to restore data since the latest backup. So, If I would find database degradation I would better attempt to recover it, fortunately, I have an experience, and possibly lose some data, but the amount of lost data will be significantly less then when I would use a backup. Around 100 hundred GB journals per day, with tens of terabytes of data supposed for backup, make the task to restore system quickly as impossible for a system that has to be available with no downtime.

 ...as much as possible avoid the need to use subscript level mapping (SLM) to manage growth of a single global across multiple databases. 

Ray, may I ask you: why should we avoid it?

I said that only because managing database size with SLM can be painful operationally: having to predict where the growth is going to be and coordinate a configuration change in advance of the new mapping range getting used by the application.  I did not mean to imply that anything bad happens when you do this.  In fact, if the growth of a global isn't bounded by some natural data lifespan, or some application-level archival process, then SLM is unavoidable with a sufficient rate of growth.  By planning in advance for the growth, though, and starting the largest expected globals mapped to their own databases, you might stave that off for a long time. 

Note: there's a little runtime cost to resolving SLM that doesn't exist for (whole) global mapping, but it's generally a noise-level cost unless you've generated a very complex set of mappings (more complex than you'd likely do as a manual configuration step)