Question
· Feb 12, 2023

How to optimize slow count distinct SQL

Hi guys,

I'm working on a project with my client.
They have a visit table which has about 7,000,000 records. The table is used in a random search page witch holds 20+ conditions to be combined.
The table is defined as below:

CREATE TABLE SQLUser.ST_SEARCH_VISIT (

HQ_ORG_CODE VARCHAR(32),

HQ_ORG_NAME VARCHAR(32),

TENANT_ID VARCHAR(32),

GROUP_PATIENT_SN VARCHAR(32),

ORG_PATIENT_SN VARCHAR(32),

NAME VARCHAR(64),

SEX_CODE VARCHAR(2),

SEX_NAME VARCHAR(16),

BIRTH_DATE DATE,

MARRY_CODE VARCHAR(2),

MARRY_NAME VARCHAR(16),

IDENTIFY_TYPE_CODE VARCHAR(5),

IDENTIFY_TYPE_NAME VARCHAR(32),

IDENTIFY_NUMBER VARCHAR(50),

BLOOD_TYPE_CODE VARCHAR(2),

BLOOD_TYPE_NAME VARCHAR(12),

MOBILE VARCHAR(50),

MAILING_ADDRESS VARCHAR(127),

VISIT_SERIAL_NO VARCHAR(64),

TABLE_FLAG VARCHAR(1),

VISIT_TYPE_CODE VARCHAR(24),

VISIT_TYPE_NAME VARCHAR(64),

VISIT_DEPT_CODE VARCHAR(32),

VISIT_DEPT_NAME VARCHAR(64),

IN_OUT_VISIT_NO VARCHAR(32),

VISIT_TIME TIMESTAMP,

DISCHARGE_TIME TIMESTAMP,

IN_HOSPITAL_TIME TIMESTAMP,

ICD_CODE VARCHAR(20),

ICD_NAME VARCHAR(64),

HEALTH_CARD_NO VARCHAR(32),

HEALTH_CARD_TYPE VARCHAR(64),

SURG_CODE VARCHAR(20),

SURG_NAME VARCHAR(64),

AGE VARCHAR(10),

DISEASE_CODE INTEGER,

DISEASE_NAME VARCHAR(64),

LastChangedTime TIMESTAMP,

LastCreateTime TIMESTAMP,

GROUPORG VARCHAR(50)

);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_BLOOD_TYPE_CODE ON SQLUser.ST_SEARCH_VISIT (BLOOD_TYPE_CODE);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_HQ_ORG_CODE ON SQLUser.ST_SEARCH_VISIT (HQ_ORG_CODE);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_IDENTIFY_TYPE_CODE ON SQLUser.ST_SEARCH_VISIT (IDENTIFY_TYPE_CODE);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_MARRY_CODE ON SQLUser.ST_SEARCH_VISIT (MARRY_CODE);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_SEX_CODE ON SQLUser.ST_SEARCH_VISIT (SEX_CODE);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_TABLE_FLAG ON SQLUser.ST_SEARCH_VISIT (TABLE_FLAG);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_VISIT_DEPT_CODE ON SQLUser.ST_SEARCH_VISIT (VISIT_DEPT_CODE);

CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_VISIT_TYPE_CODE ON SQLUser.ST_SEARCH_VISIT (VISIT_TYPE_CODE);

CREATE INDEX INDEX_PI_PATIENT_INFO_HQ_ORG_CODE ON SQLUser.ST_SEARCH_VISIT (HQ_ORG_CODE,GROUP_PATIENT_SN);

CREATE INDEX INDEX_ST_SEARCH_VISIT_BIRTH_DATE ON SQLUser.ST_SEARCH_VISIT (BIRTH_DATE);

CREATE INDEX INDEX_ST_SEARCH_VISIT_DISCHARGE_TIME ON SQLUser.ST_SEARCH_VISIT (DISCHARGE_TIME);

CREATE INDEX INDEX_ST_SEARCH_VISIT_GROUP_PATIENT_SN ON SQLUser.ST_SEARCH_VISIT (GROUP_PATIENT_SN);

CREATE INDEX INDEX_ST_SEARCH_VISIT_IDENTIFY_NUMBER ON SQLUser.ST_SEARCH_VISIT (IDENTIFY_NUMBER);

CREATE INDEX INDEX_ST_SEARCH_VISIT_IN_HOSPITAL_TIME ON SQLUser.ST_SEARCH_VISIT (IN_HOSPITAL_TIME);

CREATE INDEX INDEX_ST_SEARCH_VISIT_IN_OUT_VISIT_NO ON SQLUser.ST_SEARCH_VISIT (IN_OUT_VISIT_NO);

CREATE INDEX INDEX_ST_SEARCH_VISIT_MOBILE ON SQLUser.ST_SEARCH_VISIT (MOBILE);

CREATE INDEX INDEX_ST_SEARCH_VISIT_NAME ON SQLUser.ST_SEARCH_VISIT (NAME);

CREATE INDEX INDEX_ST_SEARCH_VISIT_ORG_PATIENT_SN ON SQLUser.ST_SEARCH_VISIT (ORG_PATIENT_SN);

CREATE INDEX INDEX_ST_SEARCH_VISIT_VISIT_SERIAL_NO ON SQLUser.ST_SEARCH_VISIT (VISIT_SERIAL_NO);

CREATE INDEX INDEX_ST_SEARCH_VISIT_VISIT_TIME ON SQLUser.ST_SEARCH_VISIT (VISIT_TIME);

CREATE INDEX IdxGpsnorg ON SQLUser.ST_SEARCH_VISIT (GROUP_PATIENT_SN,HQ_ORG_CODE);

CREATE INDEX IdxMapOrg ON SQLUser.ST_SEARCH_VISIT (HQ_ORG_CODE);

CREATE INDEX IdxVisitorg ON SQLUser.ST_SEARCH_VISIT (VISIT_SERIAL_NO,HQ_ORG_CODE);

CREATE INDEX LastChangedTimeIndex ON SQLUser.ST_SEARCH_VISIT (LastChangedTime);

CREATE INDEX LastCreateTimeIndex ON SQLUser.ST_SEARCH_VISIT (LastCreateTime);

CREATE INDEX idxGO ON SQLUser.ST_SEARCH_VISIT (GROUPORG);

 

Now we have a slow SQL as below:

select count(1) from

(select distinct by (b.VISIT_SERIAL_NO,b.HQ_ORG_CODE) b.VISIT_SERIAL_NO,b.HQ_ORG_CODE  from SQLUser.ST_SEARCH_VISIT b )

Which is used to count distinct visits in about 20 hospitals.

This SQL will take more than 10 seconds to finish on their server.

I did a test on my virtual machines with the same table but with populated data(6,000,000 records) then it will took about 16 seconds to finish the search.

I tried to optimize the sql with the followings

------about 12.5s-------

select count(1) from (
    select distinct by (b.VISIT_SERIAL_NO,b.HQ_ORG_CODE) b.VISIT_SERIAL_NO,b.HQ_ORG_CODE  from %PARALLEL ST_SEARCH_VISIT b
    group by b.HQ_ORG_CODE
)

------about 24s-------

select SUM(OrgSum) from (

select HQ_ORG_CODE,count(1) as OrgSum from %PARALLEL (

select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE

From %PARALLEL ST_SEARCH_VISIT

Group by HQ_ORG_CODE

)

group by HQ_ORG_CODE

)

 

So what optimizations could be done to make it faster? thanks.

Discussion (7)3
Log in or sign up to continue

For a start:

select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE
can be changed to
select distinct VISIT_SERIAL_NO,HQ_ORG_CODE

It will do the same.

Secondly:
Will you please remove the %parallel and click on "Show Plan". Post that plan here. It will help to determine where the query is slow. It might be using the wrong index. There are many.

Lastly:
Have you tuned the table and checked the result after that?

Hi Stefan,

If I change 

select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE

into select distinct VISIT_SERIAL_NO,HQ_ORG_CODE,

The sum number is quite different

actually the second sql is equal to 

it seems to be caused by the group by clause on a bitmap index, if I remove it, the result is then consistent

I tuned the table and rebuild the indices but the result is the same.

The followings are the query plans

If you want to use the group by, then you should probably do the count where the group by is being done, and use distinct by as you had it.

select distinct by(serial,hq) hq, count(1)
from thetable
group by hq

If you want it per that grouping.

There are no filters in it, so it is going to do a full table scan and compare and calculate values for each row. Taking the amount of time it takes, it is actually fast.

Maybe look into Bitslice indexes. It might help but at a cost of performance on insert and update:
InterSystems Documentation

There are a couple of points:

  • could you please publish the source code of the class, and not the SQL script? The fact is that the command
    CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_BLOOD_TYPE_CODE ON SQLUser.ST_SEARCH_VISIT (BLOOD_TYPE_CODE)
    creates a regular index, not a bitmap index as expected. I'm afraid my code may not match yours.
  • I created a test class with test data on Caché 2018.x and IRIS 2023.1FT and I have your query executed very quickly. Details below.
 
User.STSEARCHVISIT.cls

Generating 6000000 records

USER>##class(User.STSEARCHVISIT).Fill(6e6)
OK
explain select count(distinct by(VISIT_SERIAL_NO,HQ_ORG_CODE) 1) from ST_SEARCH_VISIT

<plans>
  <plan>
    <sql>
      select count(distinct by(VISIT_SERIAL_NO,HQ_ORG_CODE) 1) from ST_SEARCH_VISIT /*#OPTIONS {"DynamicSQL":1} */
    </sql>
    <cost value="29794316"/>
    Call module B. Output the row.
    <module name="B" top="1">
      Read index map ST_SEARCH_VISIT.IdxVisitorg, looping on %SQLUPPER(VISIT_SERIAL_NO), %SQLUPPER(HQ_ORG_CODE), and ID. For each row: Check distinct values for %SQLUPPER(HQ_ORG_CODE) and %SQLUPPER(VISIT_SERIAL_NO) using a temp-file. For each distinct row: Accumulate the count([value]).
    </module>
  </plan>
</plans>

select count(distinct by(VISIT_SERIAL_NO,HQ_ORG_CODE) 1) from ST_SEARCH_VISIT

Row count: 1 Performance: 2.8704 seconds 6003078 global references 36016391 lines executed

explain select count(1) from (select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE  from ST_SEARCH_VISIT)

<plans>
  <plan>
    <sql>
      select count(1) from (select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE from ST_SEARCH_VISIT) /*#OPTIONS {"DynamicSQL":1} */
    </sql>
    <cost value="21000"/>
    Call module B. Output the row.
    <module name="B" top="1">
      Read index map ST_SEARCH_VISIT.IdxVisitorg, looping on %SQLUPPER(VISIT_SERIAL_NO) and %SQLUPPER(HQ_ORG_CODE). For each row: Call module I. Accumulate the count([value]).
    </module>
    <module name="I" top="1">
      Read index map ST_SEARCH_VISIT.IdxVisitorg, using the given %SQLUPPER(VISIT_SERIAL_NO) and %SQLUPPER(HQ_ORG_CODE), and looping on ID. For each row: Accumulate the [value].
    </module>
  </plan>
</plans>

select count(1) from (select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE from ST_SEARCH_VISIT)

Row count: 1 Performance: 0.0269 seconds 3079 global references 17711 lines executed

Hi Vitaliy,

Yes Benjamin also find out that the exported DDL is not consistent to the class definition, but users source code is so far controlled by the end user thus we'll need to wait until they can share it with us.

But I'm not sure what happened in your test with the Fill() method as mentioned above.

In the clients env, there are more than 7 million records and the result of count distinct end is 40,000 less than the total number, which means only a very small portion of VISIT_SERIAL_NO,HQ_ORG_CODE is not unique. Thus I used populate facility to generate 6 million data and the result of count distinct is more than 5.99 million. Then the query is slow.

while the distinct by version is much better

Both was run after tune table and rebuild indecies

You can find the class file I used in the test as below, so far I only used too index to handle this case, if any other index is helpful, we may add it back

Class User.STSEARCHVISIT Extends (%Persistent, %Populate) [ ClassType = persistent, DdlAllowed, Final, Owner = {SuperUser}, ProcedureBlock, SqlRowIdPrivate, SqlTableName = ST_SEARCH_VISIT ]
{

Property HQORGCODE As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 2, SqlFieldName = HQ_ORG_CODE ];
Property HQORGNAME As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 3, SqlFieldName = HQ_ORG_NAME ];
Property TENANTID As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 4, SqlFieldName = TENANT_ID ];
Property GROUPPATIENTSN As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 5, SqlFieldName = GROUP_PATIENT_SN ];
Property ORGPATIENTSN As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 6, SqlFieldName = ORG_PATIENT_SN ];
Property NAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 7 ];
Property SEXCODE As %Library.String(MAXLEN = 2) [ SqlColumnNumber = 8, SqlFieldName = SEX_CODE ];
Property SEXNAME As %Library.String(MAXLEN = 16) [ SqlColumnNumber = 9, SqlFieldName = SEX_NAME ];
Property BIRTHDATE As %Library.Date [ SqlColumnNumber = 10, SqlFieldName = BIRTH_DATE ];
Property MARRYCODE As %Library.String(MAXLEN = 2) [ SqlColumnNumber = 11, SqlFieldName = MARRY_CODE ];
Property MARRYNAME As %Library.String(MAXLEN = 16) [ SqlColumnNumber = 12, SqlFieldName = MARRY_NAME ];
Property IDENTIFYTYPECODE As %Library.String(MAXLEN = 5) [ SqlColumnNumber = 13, SqlFieldName = IDENTIFY_TYPE_CODE ];
Property IDENTIFYTYPENAME As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 14, SqlFieldName = IDENTIFY_TYPE_NAME ];
Property IDENTIFYNUMBER As %Library.String(MAXLEN = 50) [ SqlColumnNumber = 15, SqlFieldName = IDENTIFY_NUMBER ];
Property BLOODTYPECODE As %Library.String(MAXLEN = 2) [ SqlColumnNumber = 16, SqlFieldName = BLOOD_TYPE_CODE ];
Property BLOODTYPENAME As %Library.String(MAXLEN = 12) [ SqlColumnNumber = 17, SqlFieldName = BLOOD_TYPE_NAME ];
Property MOBILE As %Library.String(MAXLEN = 50) [ SqlColumnNumber = 18 ];
Property MAILINGADDRESS As %Library.String(MAXLEN = 127) [ SqlColumnNumber = 19, SqlFieldName = MAILING_ADDRESS ];
Property VISITSERIALNO As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 20, SqlFieldName = VISIT_SERIAL_NO ];
Property TABLEFLAG As %Library.String(MAXLEN = 1) [ SqlColumnNumber = 21, SqlFieldName = TABLE_FLAG ];
Property VISITTYPECODE As %Library.String(MAXLEN = 24) [ SqlColumnNumber = 22, SqlFieldName = VISIT_TYPE_CODE ];
Property VISITTYPENAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 23, SqlFieldName = VISIT_TYPE_NAME ];
Property VISITDEPTCODE As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 24, SqlFieldName = VISIT_DEPT_CODE ];
Property VISITDEPTNAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 25, SqlFieldName = VISIT_DEPT_NAME ];
Property INOUTVISITNO As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 26, SqlFieldName = IN_OUT_VISIT_NO ];
Property VISITTIME As %Library.TimeStamp [ SqlColumnNumber = 27, SqlFieldName = VISIT_TIME ];
Property DISCHARGETIME As %Library.TimeStamp [ SqlColumnNumber = 28, SqlFieldName = DISCHARGE_TIME ];
Property INHOSPITALTIME As %Library.TimeStamp [ SqlColumnNumber = 29, SqlFieldName = IN_HOSPITAL_TIME ];
Property ICDCODE As %Library.String(MAXLEN = 20) [ SqlColumnNumber = 30, SqlFieldName = ICD_CODE ];
Property ICDNAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 31, SqlFieldName = ICD_NAME ];
Property HEALTHCARDNO As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 32, SqlFieldName = HEALTH_CARD_NO ];
Property HEALTHCARDTYPE As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 33, SqlFieldName = HEALTH_CARD_TYPE ];
Property SURGCODE As %Library.String(MAXLEN = 20) [ SqlColumnNumber = 34, SqlFieldName = SURG_CODE ];
Property SURGNAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 35, SqlFieldName = SURG_NAME ];
Property AGE As %Library.String(MAXLEN = 10) [ SqlColumnNumber = 36 ];
Property DISEASECODE As %Library.Integer(MAXVAL = 2147483647, MINVAL = -2147483648) [ SqlColumnNumber = 37, SqlFieldName = DISEASE_CODE ];
Property DISEASENAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 38, SqlFieldName = DISEASE_NAME ];
Property LastChangedTime As %Library.TimeStamp [ SqlColumnNumber = 39 ];
Property LastCreateTime As %Library.TimeStamp [ SqlColumnNumber = 40 ];
Property GROUPORG As %Library.String(MAXLEN = 50) [ SqlColumnNumber = 41 ];
Parameter USEEXTENTSET = 1;
/// Bitmap Extent Index auto-generated by DDL CREATE TABLE statement.  Do not edit the SqlName of this index.
Index DDLBEIndex [ Extent, SqlName = "%%DDLBEIndex", Type = bitmap ];
Index IdxMapOrg On HQORGCODE [ SqlName = IdxMapOrg, Type = index ];
Index IdxVisitorg On (VISITSERIALNO, HQORGCODE) [ SqlName = IdxVisitorg, Type = index ];
}

Thanks for the help from @Stefan Cronje and @Vitaliy Serdtsev 
I'm using the SQL

select count(1) from (

select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE

from %PARALLEL ST_SEARCH_VISIT

)

which ends in 2 seconds in my test env

Since the end user is expecting a query ends in 3 seconds, I'll test its performance with their other conditions combined.

Thanks again for the help.