How to optimize slow count distinct SQL
Hi guys,
I'm working on a project with my client.
They have a visit table which has about 7,000,000 records. The table is used in a random search page witch holds 20+ conditions to be combined.
The table is defined as below:
CREATE TABLE SQLUser.ST_SEARCH_VISIT (
HQ_ORG_CODE VARCHAR(32),
HQ_ORG_NAME VARCHAR(32),
TENANT_ID VARCHAR(32),
GROUP_PATIENT_SN VARCHAR(32),
ORG_PATIENT_SN VARCHAR(32),
NAME VARCHAR(64),
SEX_CODE VARCHAR(2),
SEX_NAME VARCHAR(16),
BIRTH_DATE DATE,
MARRY_CODE VARCHAR(2),
MARRY_NAME VARCHAR(16),
IDENTIFY_TYPE_CODE VARCHAR(5),
IDENTIFY_TYPE_NAME VARCHAR(32),
IDENTIFY_NUMBER VARCHAR(50),
BLOOD_TYPE_CODE VARCHAR(2),
BLOOD_TYPE_NAME VARCHAR(12),
MOBILE VARCHAR(50),
MAILING_ADDRESS VARCHAR(127),
VISIT_SERIAL_NO VARCHAR(64),
TABLE_FLAG VARCHAR(1),
VISIT_TYPE_CODE VARCHAR(24),
VISIT_TYPE_NAME VARCHAR(64),
VISIT_DEPT_CODE VARCHAR(32),
VISIT_DEPT_NAME VARCHAR(64),
IN_OUT_VISIT_NO VARCHAR(32),
VISIT_TIME TIMESTAMP,
DISCHARGE_TIME TIMESTAMP,
IN_HOSPITAL_TIME TIMESTAMP,
ICD_CODE VARCHAR(20),
ICD_NAME VARCHAR(64),
HEALTH_CARD_NO VARCHAR(32),
HEALTH_CARD_TYPE VARCHAR(64),
SURG_CODE VARCHAR(20),
SURG_NAME VARCHAR(64),
AGE VARCHAR(10),
DISEASE_CODE INTEGER,
DISEASE_NAME VARCHAR(64),
LastChangedTime TIMESTAMP,
LastCreateTime TIMESTAMP,
GROUPORG VARCHAR(50)
);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_BLOOD_TYPE_CODE ON SQLUser.ST_SEARCH_VISIT (BLOOD_TYPE_CODE);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_HQ_ORG_CODE ON SQLUser.ST_SEARCH_VISIT (HQ_ORG_CODE);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_IDENTIFY_TYPE_CODE ON SQLUser.ST_SEARCH_VISIT (IDENTIFY_TYPE_CODE);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_MARRY_CODE ON SQLUser.ST_SEARCH_VISIT (MARRY_CODE);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_SEX_CODE ON SQLUser.ST_SEARCH_VISIT (SEX_CODE);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_TABLE_FLAG ON SQLUser.ST_SEARCH_VISIT (TABLE_FLAG);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_VISIT_DEPT_CODE ON SQLUser.ST_SEARCH_VISIT (VISIT_DEPT_CODE);
CREATE INDEX BITMAP_INDEX_ST_SEARCH_VISIT_VISIT_TYPE_CODE ON SQLUser.ST_SEARCH_VISIT (VISIT_TYPE_CODE);
CREATE INDEX INDEX_PI_PATIENT_INFO_HQ_ORG_CODE ON SQLUser.ST_SEARCH_VISIT (HQ_ORG_CODE,GROUP_PATIENT_SN);
CREATE INDEX INDEX_ST_SEARCH_VISIT_BIRTH_DATE ON SQLUser.ST_SEARCH_VISIT (BIRTH_DATE);
CREATE INDEX INDEX_ST_SEARCH_VISIT_DISCHARGE_TIME ON SQLUser.ST_SEARCH_VISIT (DISCHARGE_TIME);
CREATE INDEX INDEX_ST_SEARCH_VISIT_GROUP_PATIENT_SN ON SQLUser.ST_SEARCH_VISIT (GROUP_PATIENT_SN);
CREATE INDEX INDEX_ST_SEARCH_VISIT_IDENTIFY_NUMBER ON SQLUser.ST_SEARCH_VISIT (IDENTIFY_NUMBER);
CREATE INDEX INDEX_ST_SEARCH_VISIT_IN_HOSPITAL_TIME ON SQLUser.ST_SEARCH_VISIT (IN_HOSPITAL_TIME);
CREATE INDEX INDEX_ST_SEARCH_VISIT_IN_OUT_VISIT_NO ON SQLUser.ST_SEARCH_VISIT (IN_OUT_VISIT_NO);
CREATE INDEX INDEX_ST_SEARCH_VISIT_MOBILE ON SQLUser.ST_SEARCH_VISIT (MOBILE);
CREATE INDEX INDEX_ST_SEARCH_VISIT_NAME ON SQLUser.ST_SEARCH_VISIT (NAME);
CREATE INDEX INDEX_ST_SEARCH_VISIT_ORG_PATIENT_SN ON SQLUser.ST_SEARCH_VISIT (ORG_PATIENT_SN);
CREATE INDEX INDEX_ST_SEARCH_VISIT_VISIT_SERIAL_NO ON SQLUser.ST_SEARCH_VISIT (VISIT_SERIAL_NO);
CREATE INDEX INDEX_ST_SEARCH_VISIT_VISIT_TIME ON SQLUser.ST_SEARCH_VISIT (VISIT_TIME);
CREATE INDEX IdxGpsnorg ON SQLUser.ST_SEARCH_VISIT (GROUP_PATIENT_SN,HQ_ORG_CODE);
CREATE INDEX IdxMapOrg ON SQLUser.ST_SEARCH_VISIT (HQ_ORG_CODE);
CREATE INDEX IdxVisitorg ON SQLUser.ST_SEARCH_VISIT (VISIT_SERIAL_NO,HQ_ORG_CODE);
CREATE INDEX LastChangedTimeIndex ON SQLUser.ST_SEARCH_VISIT (LastChangedTime);
CREATE INDEX LastCreateTimeIndex ON SQLUser.ST_SEARCH_VISIT (LastCreateTime);
CREATE INDEX idxGO ON SQLUser.ST_SEARCH_VISIT (GROUPORG);
Now we have a slow SQL as below:
select count(1) from
(select distinct by (b.VISIT_SERIAL_NO,b.HQ_ORG_CODE) b.VISIT_SERIAL_NO,b.HQ_ORG_CODE from SQLUser.ST_SEARCH_VISIT b )
Which is used to count distinct visits in about 20 hospitals.
This SQL will take more than 10 seconds to finish on their server.
I did a test on my virtual machines with the same table but with populated data(6,000,000 records) then it will took about 16 seconds to finish the search.
I tried to optimize the sql with the followings
------about 12.5s-------
select count(1) from (
select distinct by (b.VISIT_SERIAL_NO,b.HQ_ORG_CODE) b.VISIT_SERIAL_NO,b.HQ_ORG_CODE from %PARALLEL ST_SEARCH_VISIT b
group by b.HQ_ORG_CODE
)
------about 24s-------
select SUM(OrgSum) from (
select HQ_ORG_CODE,count(1) as OrgSum from %PARALLEL (
select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE
From %PARALLEL ST_SEARCH_VISIT
Group by HQ_ORG_CODE
)
group by HQ_ORG_CODE
)
So what optimizations could be done to make it faster? thanks.
For a start:
select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE
can be changed to
select distinct VISIT_SERIAL_NO,HQ_ORG_CODE
It will do the same.
Secondly:
Will you please remove the %parallel and click on "Show Plan". Post that plan here. It will help to determine where the query is slow. It might be using the wrong index. There are many.
Lastly:
Have you tuned the table and checked the result after that?
Hi Stefan,
If I change
select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE
into select distinct VISIT_SERIAL_NO,HQ_ORG_CODE,
The sum number is quite different
actually the second sql is equal to
it seems to be caused by the group by clause on a bitmap index, if I remove it, the result is then consistent
I tuned the table and rebuild the indices but the result is the same.
The followings are the query plans
If you want to use the group by, then you should probably do the count where the group by is being done, and use distinct by as you had it.
select distinct by(serial,hq) hq, count(1)
from thetable
group by hq
If you want it per that grouping.
There are no filters in it, so it is going to do a full table scan and compare and calculate values for each row. Taking the amount of time it takes, it is actually fast.
Maybe look into Bitslice indexes. It might help but at a cost of performance on insert and update:
InterSystems Documentation
Hi Stefan.
It is from a random search page with 20+ possible conditions to combine. This case is one of the slowest SQL we find. In other cases, with more conditions to act as filters, the performance is OK.
I'll try whether Bitslice index works for this case.
Thanks.
There are a couple of points:
Generating 6000000 records
select count(distinct by(VISIT_SERIAL_NO,HQ_ORG_CODE) 1) from ST_SEARCH_VISIT
Row count: 1 Performance: 2.8704 seconds 6003078 global references 36016391 lines executed
select count(1) from (select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE from ST_SEARCH_VISIT)
Row count: 1 Performance: 0.0269 seconds 3079 global references 17711 lines executed
Hi Vitaliy,
Yes Benjamin also find out that the exported DDL is not consistent to the class definition, but users source code is so far controlled by the end user thus we'll need to wait until they can share it with us.
But I'm not sure what happened in your test with the Fill() method as mentioned above.
In the clients env, there are more than 7 million records and the result of count distinct end is 40,000 less than the total number, which means only a very small portion of VISIT_SERIAL_NO,HQ_ORG_CODE is not unique. Thus I used populate facility to generate 6 million data and the result of count distinct is more than 5.99 million. Then the query is slow.
while the distinct by version is much better
Both was run after tune table and rebuild indecies
You can find the class file I used in the test as below, so far I only used too index to handle this case, if any other index is helpful, we may add it back
Class User.STSEARCHVISIT Extends (%Persistent, %Populate) [ ClassType = persistent, DdlAllowed, Final, Owner = {SuperUser}, ProcedureBlock, SqlRowIdPrivate, SqlTableName = ST_SEARCH_VISIT ] { Property HQORGCODE As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 2, SqlFieldName = HQ_ORG_CODE ]; Property HQORGNAME As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 3, SqlFieldName = HQ_ORG_NAME ]; Property TENANTID As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 4, SqlFieldName = TENANT_ID ]; Property GROUPPATIENTSN As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 5, SqlFieldName = GROUP_PATIENT_SN ]; Property ORGPATIENTSN As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 6, SqlFieldName = ORG_PATIENT_SN ]; Property NAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 7 ]; Property SEXCODE As %Library.String(MAXLEN = 2) [ SqlColumnNumber = 8, SqlFieldName = SEX_CODE ]; Property SEXNAME As %Library.String(MAXLEN = 16) [ SqlColumnNumber = 9, SqlFieldName = SEX_NAME ]; Property BIRTHDATE As %Library.Date [ SqlColumnNumber = 10, SqlFieldName = BIRTH_DATE ]; Property MARRYCODE As %Library.String(MAXLEN = 2) [ SqlColumnNumber = 11, SqlFieldName = MARRY_CODE ]; Property MARRYNAME As %Library.String(MAXLEN = 16) [ SqlColumnNumber = 12, SqlFieldName = MARRY_NAME ]; Property IDENTIFYTYPECODE As %Library.String(MAXLEN = 5) [ SqlColumnNumber = 13, SqlFieldName = IDENTIFY_TYPE_CODE ]; Property IDENTIFYTYPENAME As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 14, SqlFieldName = IDENTIFY_TYPE_NAME ]; Property IDENTIFYNUMBER As %Library.String(MAXLEN = 50) [ SqlColumnNumber = 15, SqlFieldName = IDENTIFY_NUMBER ]; Property BLOODTYPECODE As %Library.String(MAXLEN = 2) [ SqlColumnNumber = 16, SqlFieldName = BLOOD_TYPE_CODE ]; Property BLOODTYPENAME As %Library.String(MAXLEN = 12) [ SqlColumnNumber = 17, SqlFieldName = BLOOD_TYPE_NAME ]; Property MOBILE As %Library.String(MAXLEN = 50) [ SqlColumnNumber = 18 ]; Property MAILINGADDRESS As %Library.String(MAXLEN = 127) [ SqlColumnNumber = 19, SqlFieldName = MAILING_ADDRESS ]; Property VISITSERIALNO As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 20, SqlFieldName = VISIT_SERIAL_NO ]; Property TABLEFLAG As %Library.String(MAXLEN = 1) [ SqlColumnNumber = 21, SqlFieldName = TABLE_FLAG ]; Property VISITTYPECODE As %Library.String(MAXLEN = 24) [ SqlColumnNumber = 22, SqlFieldName = VISIT_TYPE_CODE ]; Property VISITTYPENAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 23, SqlFieldName = VISIT_TYPE_NAME ]; Property VISITDEPTCODE As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 24, SqlFieldName = VISIT_DEPT_CODE ]; Property VISITDEPTNAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 25, SqlFieldName = VISIT_DEPT_NAME ]; Property INOUTVISITNO As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 26, SqlFieldName = IN_OUT_VISIT_NO ]; Property VISITTIME As %Library.TimeStamp [ SqlColumnNumber = 27, SqlFieldName = VISIT_TIME ]; Property DISCHARGETIME As %Library.TimeStamp [ SqlColumnNumber = 28, SqlFieldName = DISCHARGE_TIME ]; Property INHOSPITALTIME As %Library.TimeStamp [ SqlColumnNumber = 29, SqlFieldName = IN_HOSPITAL_TIME ]; Property ICDCODE As %Library.String(MAXLEN = 20) [ SqlColumnNumber = 30, SqlFieldName = ICD_CODE ]; Property ICDNAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 31, SqlFieldName = ICD_NAME ]; Property HEALTHCARDNO As %Library.String(MAXLEN = 32) [ SqlColumnNumber = 32, SqlFieldName = HEALTH_CARD_NO ]; Property HEALTHCARDTYPE As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 33, SqlFieldName = HEALTH_CARD_TYPE ]; Property SURGCODE As %Library.String(MAXLEN = 20) [ SqlColumnNumber = 34, SqlFieldName = SURG_CODE ]; Property SURGNAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 35, SqlFieldName = SURG_NAME ]; Property AGE As %Library.String(MAXLEN = 10) [ SqlColumnNumber = 36 ]; Property DISEASECODE As %Library.Integer(MAXVAL = 2147483647, MINVAL = -2147483648) [ SqlColumnNumber = 37, SqlFieldName = DISEASE_CODE ]; Property DISEASENAME As %Library.String(MAXLEN = 64) [ SqlColumnNumber = 38, SqlFieldName = DISEASE_NAME ]; Property LastChangedTime As %Library.TimeStamp [ SqlColumnNumber = 39 ]; Property LastCreateTime As %Library.TimeStamp [ SqlColumnNumber = 40 ]; Property GROUPORG As %Library.String(MAXLEN = 50) [ SqlColumnNumber = 41 ]; Parameter USEEXTENTSET = 1; /// Bitmap Extent Index auto-generated by DDL CREATE TABLE statement. Do not edit the SqlName of this index. Index DDLBEIndex [ Extent, SqlName = "%%DDLBEIndex", Type = bitmap ]; Index IdxMapOrg On HQORGCODE [ SqlName = IdxMapOrg, Type = index ]; Index IdxVisitorg On (VISITSERIALNO, HQORGCODE) [ SqlName = IdxVisitorg, Type = index ]; }
Thanks for the help from @Stefan Cronje and @Vitaliy Serdtsev
I'm using the SQL
select count(1) from (
select distinct by (VISIT_SERIAL_NO,HQ_ORG_CODE) VISIT_SERIAL_NO,HQ_ORG_CODE
from %PARALLEL ST_SEARCH_VISIT
)
which ends in 2 seconds in my test env
Since the end user is expecting a query ends in 3 seconds, I'll test its performance with their other conditions combined.
Thanks again for the help.
Social networks
InterSystems resources
Log in or sign up
Log in or create a new account to continue
Log in or sign up
Log in or create a new account to continue
Log in or sign up
Log in or create a new account to continue