Article
· Apr 12 8m read

Determining Global and Table Sizes in InterSystems IRIS

Spoilers: Daily Integrity Checks are not only a best practice, but they also provide a snapshot of global sizes and density. 

Tracking the size of the data is one of the most important activities for understanding system health, tracking user activity, and for capacity planning ahead of your procurement process. InterSystems products store data in a tree-structure called globals. This article discusses how to determine global sizes – and therefore the size of your data. The focus is on balancing impact versus precision. 

SQL tables are simply a projection of underlying globals. Looking at table size currently means needing to look at the corresponding global sizes. A more efficient sampling-based mechanism is currently being developed. Understanding the relationship between tables and globals may require some additional steps, discussed below.

Data

The specific data that needs to be collected varies depending on the specific question you're trying to answer. There is a fundamental difference between the space "allocated" for a global and the space "used" by a global which is worth considering. In general, the allocated space is usually sufficient as it corresponds to the space used on disk. However, there are situations where the used space and packing data are essential -- e.g. when determining if a global is being stored efficiently following a large purge of data. 

  • Allocated Space - These are units of 8KB blocks. Generally, only one global can use one block. Therefore, even the smallest global occupies at least 8KB. This is also functionally the size on disk of the global. Determining allocated space only requires examining bottom-pointer blocks (and data-blocks which contain big-strings). Except in rare or contrived scenarios, there are typically multiple orders of magnitude fewer pointer blocks than data blocks.  This metric is usually sufficient to understand growth trends if collected on a regular basis.
  • Used Space – “Used” is the sum of the data stored within the global and the necessary overhead. Globals often allocate more space on disk than is actually “used” as a function of usage patterns and our block structure.
    • Packing: Calculating the actual space used will also provide information about the global “packing” – how densely the data is stored. It can sometimes be necessary or desirable to store the globals more efficiently -- especially if they are not frequently updated. For systems with random updates, inserts, or deletes, a packing of 70% is generally considered optimal for performance. This value fluctuates based on activity. Spareness most often correlates with deletions. 
    • IO Cost: Unfortunately, with great precision comes great IO requirements. Iterating 8KB block by 8KB block through a large database will not only take a long time, but it may also negatively impact performance on systems that are already close to their provisioned limits. This is much more expensive than determining if a block is allocated. This operation will take on the order of (# of parallel processes) / (read latency) * (database size – free space) to return an answer. 

InterSystems provides several tools for determining the size of globals within a particular database. Generally, both the global name and the full path of the underlying database directory need to be known in order to determine the size. For more complex deployments, math is required to determine the total size of a global spread across multiple databases via subscript level mapping.

Determining Global Names:

  • Use the Extent Manager to list the globals associated with a table:
    • SQL: Call %ExtentMgr.GlobalsUsed('Package.Class.cls')
  • Review the storage definition within the Management Portal, within VS Code (or Studio), or by querying %Dictionary.StorageDefinition.
    • SQL: SELECT DataLocation FROM %Dictionary.StorageDefinition WHERE parent = 'Package.ClassName'
    • ObjectScript: write ##class(%Dictionary.ClassDefinition).%OpenId("Package.ClassName").Storages.GetAt(1).DataLocation
  • Hashed Global Names are common when the tables are defined using DDLs, i.e. CREATE TABLE. This behavior can be modified by specifying USEEXTENTSET and DEFAULTGLOBAL. Using hashed global names and storing only one index per global have shown performance benefits. I use the following query to list the non-obvious globals in a namespace
    • SQL for All Classes:
      SELECT Parent, DataLocation, IndexLocation, StreamLocation 
      FROM %Dictionary.StorageDefinition
      WHERE Parent->System = 0 AND DataLocation IS NOT NULL
    • SQL for Specific Classes:
      CALL %ExtentMgr.GlobalsUsed('Package.Class.cls')

Determining Database Path:

  • For the simplest deployments where a namespace does not have additional global mappings for application data, it is often possible to substitute "." for the directory. That syntactic sugar will tell the API to look at the current directory for the current namespace.
  • For SQL oriented deployments, CREATE DATABASE follows our best practices and creates TWO databases -- one for code and one for data. It’s best to verify the default globals database for the given Namespace in the Management Portal or in the CPF.
  • It is possible to programmatically determine the destination directory for a particular global (or subscript) in the current namespace:
    • ObjectScript:
      set global = "globalNameHere" 
      set directory = $E(##class(%SYS.Namespace).GetGlobalDest($NAMESPACE, global),2,*)
  • For more complex deployments with many mappings, it may be necessary to iterate through Config.MapGlobals in the %SYS Namespace and sum the global sizes:
    • SQL: SELECT Namespace, Database, Name FROM Config.MapGlobals

Determining Global Sizes:

Once the name of the global and the destination database path are determined, it is possible to collect information on the global size. Here are a few options:  

  • Integrity Check – Nightly Integrity Checks are a good practice. An even better practice is to perform them against a restored backup to also verify the backup and restore process while offloading the IO to another system. This process verifies the physical integrity of the database blocks by reading each allocated block. It also tabulates both the allocated space of all the globals AND tracks the average packing of the blocks along the way.
    • See Ray’s great post on Integrity Check performance.
    • In IRIS 2022.1+, Integrity Checks can now even multi-process a single global.
    • Example Integrity Check Output: 
      Global: Ens.MessageHeaderD                            0 errors found   
      Top Pointer Level:    # of blocks=1      8kb (2% full)   
      Pointer Level:        # of blocks=25      200kb (19% full)   
      Bottom Pointer Level: # of blocks=3,257      25MB (79% full)   
      Data Level:           # of blocks=2,630,922      20,554MB (88% full)   
      Total:                # of blocks=2,634,205      20,579MB (88% full)   
      Elapsed Time = 238.4 seconds, Completed 01/17/2023 23:41:12  
  • %Library.GlobalEdit.GetGlobalSize – The following APIs can be used to quickly determine the allocated size of a single global. This may still take some time for multi-TB globals.
    • ObjectScript: w ##class(%Library.GlobalEdit).GetGlobalSize(directory, globalName, .allocated, .used, 1)
    • Embedded Python:
      import iris 
      allocated = iris.ref("")
      used =iris.ref("")
      fast=1
      directory = "/path/to/database"
      global = "globalName"
      iris.cls('%Library.GlobalEdit').GetGlobalSize(directory, global, allocated, used, fast)
      allocated.value
      used.value
  • %Library.GlobalEdit.GetGlobalSizeBySubscript – This is helpful for determining the size of subscript or subscript range. E.g. Determine the size of one index. It will include all descendants within the specified range. Warning: as of IRIS 2023.1 there is not a “fast” flag to only return the allocated size. It will read all of the data blocks within the range.
    • ObjectScript: ##class(%Library.GlobalEdit).GetGlobalSizeBySubscript(directory, startingNode, EndingNode, .size)
  • %SYS.GlobalQuery.Size – This API is helpful for surveying multiple globals within a database, with or without filters. A SQL Stored Procedure available for customers that primarily interact with IRIS via SQL.
    • SQL: CALL %SYS.GlobalQuery_Size('database directory here', '','*',,,1)
  • ^%GSIZE – Executing this legacy utility and choosing to “show details” will read each data block to determine the size of the data. Without filtering the list of globals, it may read through almost the entire database block by block with a single thread.
    • Running ^%GSIZE with details is the slowest option for determining global sizes. It much slower than our heavily optimized Integrity Checks! 
    • There is an additional entry point that will return the allocated size for a particular global – including when scoped to a subscript. Unfortunately, it does not work on subscript ranges.
    • ObjectScript: write $$AllocatedSize^%GSIZE("global(""subscript"")")
  • Database Size – The easiest case for determining global size is when there is only a single global within a single database. Simply subtract the total free space within the database from the total size of the database. The database size is available from the OS or via SYS.Database. I often use a variation of this approach to determine the size of a disproportionately large global by subtracting the sum of all the other globals in the database.
    • ObjectScript: ##class(%SYS.DatabaseQuery).GetDatabaseFreeSpace(directory, .FreeSpace)
    • SQL: call %SYS.DatabaseQuery_FreeSpace()
    • Embedded Python:
    • import iris
      freeSpace = iris.ref("")
      directory = "/path/to/database"
      iris.cls('%SYS.DatabaseQuery').GetDatabaseFreeSpace(directory, freeSpace)
      freeSpace.value  
  • Process Private Globals - PPGs are special process-scoped globals stored within IRISTEMP. They are often not enumerated by the other tools. When IRISTEMP is expanding rapidly or reporting low freespace, PPGs are frequently the explanation. Consider examining the per process usage of PPGs via %SYS.ProcessQuery. 
    • SQL: SELECT PID, PrivateGlobalBlockCount FROM %SYS.ProcessQuery ORDER BY PrivateGlobalBlockCount DESC

Questions for the readers:

  1. How often do you track your global sizes?
  2. What do you do with global size information?  
  3. For SQL focused users, do you track the size of individual indices?
Discussion (4)2
Log in or sign up to continue

Chad, thank you for complete explanation of available options. As to you questions:

1. We have a TASKMGR task which calculates the size of each global in all databases. It's usually scheduled by our customers for daily run.
2. The main purpose of collecting such info is the ability to quickly answer the questions like this: "why my database is growing so fast?". Integrity Check is not used for the similar purpose because it can't be scheduled for daily run due to its relative slowness in our versions of Cache and IRIS.

Great article Chad!

FWIW, we're working on a faster version of ^%GSIZE (with an API more fitting the current century ;-) ) that uses stochastic sampling similar to the faster table stats gathering introduced in 2021.2? I'll also take the opportunity for a shameless plug of my SQL utilities package, which incorporates much of what's described in this article and will take advantage of that faster global size estimator as soon as it's released.