検索

Article
· Sep 2, 2020 7m read

Integrity Check: Speeding it Up or Slowing it Down

While the integrity of Caché and InterSystems IRIS databases is completely protected from the consequences of system failure, physical storage devices do fail in ways that corrupt the data they store.  For that reason, many sites choose to run regular database integrity checks, particularly in coordination with backups to validate that a given backup could be relied upon in a disaster.  Integrity check may also be acutely needed by the system administrator in response to a disaster involving storage corruption.  Integrity check must read every block of the globals being checked (if not already in buffers), and in an order dictated by the global structure. This takes substantial time, but integrity check is capable of reading as fast as the storage subsystem can sustain.  In some situations, it is desirable to run it in that manner to get results as quickly as possible.  In other situations, integrity check needs to be more conservative to avoid consuming too much of the storage subsystem’s bandwidth. 

Plan of Attack

This following outline caters for most situations.  The detailed discussion in the remainder of this article provides the necessary information to act on any of these, or to derive other courses of action. 

  1. If using Linux and integrity check is slow, see the information below on enabling Asynchronous I/O. 
  2. If integrity check must complete as fast as possible - running in an isolated environment, or because results are needed urgently - use Multi-Process Integrity Check to check multiple globals or databases in parallel.  The number of processes times the number of concurrent asynchronous reads that each process will perform (8 by default, or 1 if using Linux with asynchronous I/O disabled) is the limit on the number of concurrent reads in flight.  Consider that the average may be half that and then compare to the capabilities of the storage subsystem.  For example, with storage striped across 20 drives and the default 8 concurrent reads per process, five or more processes may be needed to capture the full capacity of the storage subsystem (5*8/2=20).
  3. When balancing integrity check speed against its impact on production, first adjust the number of processes in the Multi-Process Integrity Check, then if needed, see the SetAsyncReadBuffers tunable.  See Isolating Integrity Check below for a longer-term solution (and for eliminating false positives).
  4. If already confined to a single process (e.g. there’s one extremely large global or other external constraints) and the speed of integrity check needs adjustment up or down, see the SetAsyncReadBuffers tunable below.

Multi-Process Integrity Check

The general solution to get an integrity check to complete faster (using system resources at a higher rate) is to divide the work among multiple parallel processes.  Some of the integrity check user interfaces and APIs do so, while others use a single process.  Assignment to processes is on a per-global basis, so checking a single global is always done by just one process (versions prior to Caché 2018.1 divided the work by database instead of by global).

The principal API for multi-process integrity check is CheckLIst^Integrity (see documentation for details). It collects the results in a temporary global to be displayed by Display^Integrity. The following is an example checking three databases using five processes. Omitting the database list parameter here checks all databases.

set dblist=$listbuild(“/data/db1/”,”/data/db2/”,”/data/db3/”)
set sc=$$CheckList^Integrity(,dblist,,,5)
do Display^Integrity()
kill ^IRIS.TempIntegrityOutput(+$job)

/* Note: evaluating ‘sc’ above isn’t needed just to display the results, but...
   $system.Status.IsOK(sc) - ran successfully and found no errors
   $system.Status.GetErrorCodes(sc)=$$$ERRORCODE($$$IntegrityCheckErrors) // 267
                           - ran successfully, but found errors.
   Else - a problem may have prevented some portion from running, ‘sc’ may have 
          multiple error codes, one of which may be $$$IntegrityCheckErrors. */

Using CheckLIst^Integrity like this is the most straight-forward way to achieve the level of control that is of interest to us.  The Management Portal interface and the Integrity Check Task (built-in but not scheduled) use multiple processes, but may not offer sufficient control for our purposes.*

Other integrity check interfaces, notably the terminal user interface, ^INTEGRIT or ^Integrity, as well as Silent^Integrity, perform integrity check in a single process. These interfaces, therefore, do not complete the check as fast as it's possible to achieve, and they use fewer resources.  An advantage, though, is that their results are visible, logged to a file or output to the terminal, as each global is checked, and in a well-defined order.

Asynchronous I/O

An integrity check process walks through each pointer block of a global, one at a time, validating each against the contents of the data blocks it points to.  The data blocks are read with asynchronous I/O to keep a number of read requests in flight for the storage subsystem to process, and the validation is performed as each read completes. 

On Linux only, async I/O is effective only in combination with direct I/O, which is not enabled by default until InterSystems IRIS 2020.3.  This accounts for a large number of cases where integrity check takes too long on Linux.  Fortunately, it can be enabled on Cache 2018.1, IRIS 2019.1 and later, by setting wduseasyncio=1 in the [config] section of the .cpf file and restarting.  This parameter is recommended in general for I/O scalability on busy systems and is the default on non-Linux platforms since Caché 2015.2.  Before enabling it, make sure that you’ve configured sufficient memory for database cache (global buffers) because with Direct I/O, the databases will no longer be (redundantly) cached by Linux.  When not enabled, reads done by integrity check complete synchronously and it cannot utilize the storage efficiently. 

On all platforms, the number of reads that an integrity check process will put in flight at one time is set to 8 by default.  If you must alter the rate at which a single integrity check process reads from disk this parameter can be tuned – up to get a single process to complete faster, down to use less storage bandwidth.  Bear in mind that:

  • This parameter applies to each integrity check process.  When multiple processes are used, the number of processes multiplies this number of in-flight reads  Changing the number of parallel integrity check processes has a much larger impact and therefore is usually the first thing to do.  Each process is also limited by computational time (among other things) so there increasing the value of this parameter is limited in its benefit.
  • This only works within the storage subsystem’s capacity to process concurrent reads. Higher values have no benefit if databases are stored on a single local drive, whereas a storage array with striping across dozens of drives can process dozens of reads concurrently.

To adjust this parameter from the %SYS namespace, do SetAsyncReadBuffers^Integrity(value). To see the current value, write $$GetAsyncReadBuffers^Integrity(). The change takes effect when the next global is checked.  The setting currently does not persist through a restart of the system, though it can be added to SYSTEM^%ZSTART.

There is a similar parameter to control the maximum size of each read when blocks are contiguous on disk (or nearly so).  This parameter is less often needed, though systems with high storage latency or databases with larger block sizes could possibly benefit from fine tuning.  The value has units of 64KB, so a value of 1 is 64KB, 4 is 256KB, etc.  0 (the default) lets the system to select and it currently selects 1 (64KB).  The ^Integrity function for this parameter, parallel to those mentioned above, are SetAsyncReadBufferSize and GetAsyncReadBufferSize.

Isolating Integrity Check

Many sites run regular integrity checks directly on the production system. This is certainly the simplest to configure, but it’s not ideal.  In addition to concerns about integrity check’s impact on storage bandwidth, concurrent database update activity can sometimes lead to false positive errors (despite mitigations built into the checking algorithm).  As a result, errors reported from an integrity check run on production need to be evaluated and/or rechecked by an administrator.

Often times, a better option exists.  A storage snapshot or backup image can be mounted on another host, where an isolated Caché or IRIS instance runs the integrity check.  Not only does this prevent any possibility of false positives, but if the storage is also isolated from production, integrity check can be run to fully utilize the storage bandwidth and complete much more quickly.  This approach fits well into the model where integrity check is used to validate backups; a validated backup effectively validates production as of the time the backup was made.  Cloud and virtualization platforms can also make it easier to establish a usable isolated environment from a snapshot.

 


The Management Portal interface, the Integrity Check Task and the IntegrityCheck method of SYS.Database select a rather large number of processes (equal to the number of CPU cores), lacking the control that’s needed in many situations. The management portal and the task also perform a complete recheck of any global that reported error in effort to identify false positives that may have occurred due to concurrent updates. This recheck occurs above and beyond the false positive mitigation built into the integrity check algorithms, and that may be unwanted in some situations due to the additional time it takes (the recheck runs in a single process and checks the entire global). This behavior may be changed in the future.

8 Comments
Discussion (8)0
Log in or sign up to continue
Article
· Aug 28, 2020 2m read

Effective use of Collection Indexing and Querying Collections through SQL

Triggered by a question placed by @Kurro Lopez  recently 
I took a closer look at the indexing of collections.
My simple test setup is a serial class and a persistent class with a list of this serial.

Class rcc.IC.serItem Extends (%SerialObject, %Populate)
{ Property Subject As %String [ Required ]; 
  Property Change As %TimeStamp [ Required ]; 
  Property Color As %String(COLLATION = "EXACT", 
     VALUELIST = ",red,white,blue,yellow,black,unknown") [ Required ];
}
Class rcc.IC.ItemList Extends (%Persistent, %Populate) [ Final ]
{ Property Company As %String [ Required ]; 
  Property Region As list Of %String(COLLATION = "EXACT", POPSPEC = ":4",
     VALUELIST = ",US,CD,MX,EU,JP,AU,ZA") [ Required ];
  Property Items As list Of rcc.IC.serItem(POPSPEC = ":4") [ Required ];
 
  Index xitm On Items(ELEMENTS);
  Index ycol On Items(ELEMENTS).Color;
}

Related Docs
Index xitm holds the complete serial element. !!
With some records generated by %Populate utility  I could place this query

Select ID,Company from rcc_IC.ItemList
Where FOR SOME %ELEMENT(rcc_IC.ItemList.Items) ($list(%Value,3) in ('blue','yellow'))

This works OK but disassembling every serial object wasn't very promising for my performance considerations.
So I followed a hit from @Dan Pasco  recently seen in this forum a few days ago,
and expecting better performance I added 

Index ycol On Items(ELEMENTS).Color;

The result was rather disappointing.
No improvement.
Investigation of the query plan showed that the new index was just ignored.


 After some trials, this query satisfied my needs

Select ID,Company 
from %IGNOREINDEX xitm rcc_IC.ItemList
Where FOR SOME %ELEMENT(rcc_IC.ItemList.Items) ('blue,yellow' [ %Value )

with

During the investigation with many variations I found this rule:

IF you have more than one ELEMENT index on the same property the 
query generator always takes the alphabetic first index it finds.
And you have to explicitly exclude a non-fitting index.

As  there is no hint in the documentation I would like to know:

Is this observation correct or is it just an accidental effect in my case?

As ELEMENT index was designed for List of %String  I understand that  having
more than one index was just an unlikely case at the time of design.

GitHub

5 Comments
Discussion (5)0
Log in or sign up to continue
Job
· Aug 18, 2020

InterSystems Ensemble/Health Connect Developer Required

We have an immediate requirement for an experienced Intersystems Ensemble/Health Connect consultant to join our team with a good grounding in OO Programming, Healthcare Integration and at least 2 years experience with InterSystems Ensemble/Healthshare Health Connect

Discussion (0)0
Log in or sign up to continue
Question
· Aug 17, 2020

Passing values to the read prompts via cache routine?

Hi Developers 

Is there any way that we can pass the values to the read prompts via cache routine.

For example, we have a couple of reports/routine in our system which accepts some inputs and after taking the inputs it generates some data. Right now it has proper UI and where User enters the value and in routines we have Read statements which accepts those inputs for further processing.

Now we want to schedule all these reports on a task manager on cache and don't want to modify the existing routines  so wanted to check is there any way to pass the values to those "Read statement" via cache routine ?

To make it more clear, following is a test routine,  if we run on terminal it accepts for two inputs, now if we want to trigger the same via a routine, then how to supply these "Read"  command values ?

TestRun
 !,"Enter First Variable Name" a
 !,"Enter Second Variable Name" b
 ^zparas=a_" "_b

quit

 

 

Thanks in Advance.

7 Comments
Discussion (7)1
Log in or sign up to continue
Question
· Aug 12, 2020

Is there a way to trigger system functions from Alerting in Ensemble

We have a vendor that every couple of days will just stop transmitting messages, but still hold the TCP/IP connection open. No matter how many times we troubleshoot and talk with them, they don't seem to think its an issue with system.  Normally if I just restart the service it will get the data flowing again.

I know ideal is for them to fix the issue, but in the meantime I have setup an Inactivity time out alert.  I was wondering with the correct filtering if there was a way to say if the Inactivity Alert is triggered during the business day, to have the Alert trigger a restart of the service?

Thanks

Scott

3 Comments
Discussion (3)3
Log in or sign up to continue