Question
· Aug 8

count() is too slow. Why?

I have a table with 5M rows, the table contains lab observation codes and display names, both columns have type varchar(2000) and both are indexed.

The query looks like:

select code_1_text, count(code_1_text)
  from demo.observation_lab
  group by code_1_text
  order by 2 desc

The table contains ~1000 distinct display names.

It takes 4 minutes for the query to complete on a VM with some pretty old Xeon, 4 cores, 32G RAM, NVME SSD and Linux on board.

Isn't it too slow? During the execution I see no active hardware resources consumption - CPU load is 12-25%, RAM is almost free, swap file is not used.

The execution plan is as follows:

Relative Cost = 3098072
• Call module F, which populates temp-file B.
• Call module J, which populates temp-file C.
• Read temp-file C, looping on count([value]) and a counter.
• For each row:
    - Output the row.
Module: F

• Divide master map demo.observation_lab.IDKEY into subranges of IDs.
• Call module A in parallel on each subrange, piping results into temp-file D.
• Read temp-file D, looping on a counter.
• For each row:
    - Check distinct values for %SQLUPPER(code_1_text) using temp-file B,
        subscripted by a hash of the %SQLUPPER(code_1_text).
    - For each distinct row:
        · Add a row to temp-file B, subscripted by the hash,
            with node data of %SQLUPPER(code_1_text).
    - Update the accumulated count([value]) in temp-file B,
        subscripted by the hash
Module: J

• Read temp-file B, looping on the hash subscript.
• For each row:
    - Add a row to temp-file C, subscripted by count([value]) and a counter,
        with node data of the uncollate expression.
Module: A

• Call module B, which populates temp-file A.
• Read temp-file A, looping on the hash subscript.
• For each row:
    - Add a row to temp-file D, subscripted by a counter, with node data of %SQLUPPER(code_1_text) and count([value]).
Module: B

• Read master map demo.observation_lab.IDKEY, looping on the subrange of ID.
• For each row:
    - Check distinct values for %SQLUPPER(code_1_text) using temp-file A,
        subscripted by a hash of the %SQLUPPER(code_1_text).
    - For each distinct row:
        · Add a row to temp-file A, subscripted by the hash,
            with node data of %SQLUPPER(code_1_text).
    - Update the accumulated count([value]) in temp-file A,
        subscripted by the hash

Why is it using temp files instead of RAM?

Also, if I rewrite the query to aggregate lab observation codes (which are much shorter than display names) the query starts fetching almost immediately.

Product version: IRIS 2025.1
Discussion (3)3
Log in or sign up to continue

The Temp file often times uses a process private global 

Process-private globals are written to the IRISTEMP database. In contrast to global variables, InterSystems IRIS does not treat a SET or KILL of a local variable or a process-private global as a journaled transaction event; rolling back the transaction has no effect on these operations.

That being said as Peter mentioned if you had an index on code_1_text you could greatly improve performance.  In fact I suspect this type of query would be completely index satisfiable, ie the query would only examine the index and would not have to go to the mastermap for even better performance.  

Depending on the legth of values for code_1_text if you do chose to add an index you might want to define the property as

Property code_!_text as %String(COLLATION="SQLUPPER(113)",MAXLEN=32000);

by setting the collation and length.  If the values exceed the maximum global subscript length and you do not do this you could encounter <subscript> errors.

A subscript has an illegal value or a global reference is too long. For further details, refer to $ZERROR. For more information on maximum length of global references, see “Determining the Maximum Length of a Subscript”.