Improve SQL Performance for Date Range Queries

Primary tabs

Date range queries going too slow for you?  SQL Performance got you down?  I have one weird trick that might just help you out! (SQL Developers hate this!)*

If you have a class that records timestamps when the data is added, then that data will be in sequence with your IDKEY values - that is, TimeStamp< TimeStampif and only if ID1 < IDfor all IDs and TimeStamp values in table - then you can use this knowledge to increase performance for queries against TimeStamp ranges.  Consider the following table:

Class User.TSOrder extends %Persistent 
{ 

Property TS as %TimeStamp;

Property Data as %String (MAXLEN=100, MINLEN=200);

Index TSIdx on TS;

Index Extent [type=bitmap, extent];

}

Populating this with 30,000,000 random rows with dates over the last 30 days, will give you 1,000,000 rows per day.  Now if we want to query the information for a given day you might write the following

SELECT ID, TS, Data 
FROM TSOrder
WHERE 
     TS >= '2016-07-01 00:00:00.00000' AND 
     TS <= '2016-07-01 23:59:59.999999'

A reasonable query, to be sure.  On my system, however, this took 2,172,792 global references and 7.2 seconds.  But, knowing that the IDs and TimeStamps are in the same order, we can use the TimeStamps to get an ID range.  Consider the following query:

SELECT ID, TS, Data
FROM TSOrder
WHERE 
     ID >= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 00:00:00.00000' ORDER BY TS ASC) AND 
     ID <= (SELECT TOP 1 ID FROM TSOrder WHERE TS <='2016-07-01 23:59:59.999999' ORDER BY TS DESC)

The new query completes in 5.1 seconds and takes only 999,985 global references**!  

This technique can be applied more pragmatically to tables with more indexed fields and queries that have multiple WHERE clauses.  The ID range generated from the subqueries can be put into bitmap format, generating blazing speed when you get a multi-index solution.   The Ens.MessageHeader table is a great example where you can put this trick to work.

Let's be clear - this is an EXAMPLE of a win.  If you have many conditions in the WHERE clause in the same table (and they are indexed, duh!), then this technique can give you BIGGER wins!  Try it out on your queries!  

* SQL Developers don't really hate this, but if the internet has taught us anything is that catchy blurbs get more traffic.

** When testing queries that return so many rows, the SMP cannot handle it, and most of the time is taken in displaying the data.  The proper way to test is with embedded or dynamic SQL, running through the results, but not outputting them for time, and using the SQL Shell for your global counts.  You could also use SQL Stats for that.

  • + 16
  • 1
  • 7338
  • 7

Comments

+1

This is what the Ensemble message viewer does when the search criteria includes a date/time range.

Hi, Kyle! 

Thank you, great stuff!

The guys in Russian forum says there is a bug in the line:

Property Data as %String (MAXLEN=100MINLEN=200);

How can MINLEN be higher MAXLEN?

 

The premise is that when you have a timestamp property that’s set at the time of row insert, there will be a guarantee of those timestamps being “in order” with respect to Row IDs.

At first glance that sounds reasonable, and is probably almost always true, but I’m not sure it’s guaranteed.  Isn’t there a race condition around saving the timestamp and generating the new row ID?

That is, couldn’t you have a flow like this:

Process 1, Step 1:  Call %Save.  In %OnSave:  set ..CreatedAt = $zts  (let’s say this gives me 2018-06-01 16:00:00.000)

Process 2, Step 1:  Call %Save.  In %OnSave:  set ..CreatedAt = $zts   (let’s say this gives me 2018-06-01 16:00:00.010)  << +10ms

Process 2, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 1)

Process 1, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 2)

Is that likely?  Definitely not, but I don't think it's impossible.


Actually, it might be fairly likely in an ECP environment where multiple ECP Clients are inserting data into the same table, one reason being that system clocks could be out of sync by a few milliseconds.


Does that make sense, or am I missing something?  For example, would this all be okay unless I did something dumb with Concurrency?  If so, would that still be the case in an ECP environment?

Clayton,
I agree with you, especially for fast inserting multi-server environments.
In "slow" environments there is less risk.
It depends on where you set your timestamp. 
So %OnBeforeSave might provide the smallest possible gap.
 
 

Here is my solution, which has a number of advantages:

  1. there are no restrictions to the order of ID
  2. you can search in queries for the entire timestamp or for parts of it
  3. standard, reliable and proven Caché features are used
  4. this works very, very fast.

See my article for details: Indexing of non-atomic attributes

Class dc.TSOrder Extends (%Persistent%Populate)
{

Index Extent [ ExtentType = bitmap ];

Index iSmartTS On (TS(KEYS), TS(ELEMENTS));

Index iTS On TS;

Property TS As %TimeStamp(MAXVAL "2016-07-31 23:59:59.999999"MINVAL "2016-07-01 00:00:00.000000");

Property Data As %String(MAXLEN 200MINLEN 100);

ClassMethod TSBuildValueArray(
  value,
  ByRef arrayAs %Status
{
  value="" {
    array(0)=value
  }else{
    date=$p(value," ",1),
    time=$p(value," ",2),
    array("date")=date,
    array("time")=time,
    array("ddhh")=$p(date,"-",3)_"-"_$p(time,":",1)
  }
  q $$$OK
}

/// d ##class(dc.TSOrder).Fill()
ClassMethod Fill(= {30e6})
{
  ..%KillExtent(), ..Populate(N), $system.SQL.TuneTable($classname(),$$$YES), $system.OBJ.Compile($classname(),"cu-d")
}

}

Results from SMP:

select distinct null from dc.TSOrder where TS between {ts '2016-07-01 00:00:00.00000'} AND {ts '2016-07-01 23:59:59.999999'}

Performance: 2.339 seconds  2109755 global references 14768692 lines executed
(the number of selected records is 968476, used the normal index)

select distinct null from dc.TSOrder where for some %element(TS) (%key='date' and %value '2016-07-01')

Performance: 2.269 seconds  1936962 global references 15496098 lines executed
(the number of selected records is 968476, used the "smart" index)

select distinct null from dc.TSOrder where for some %element(TS) (%key='ddhh' and %value '01-13')

Performance: 0.096 seconds  80488 global references 644270 lines executed
(the number of selected records is 40239, used the "smart" index)

Maybe

Property TS As List Of %TimeStamp;

?

UPD. Nevermind, got it.