Article
· Jul 19, 2016 2m read

Improve SQL Performance for Date Range Queries

Date range queries going too slow for you?  SQL Performance got you down?  I have one weird trick that might just help you out! (SQL Developers hate this!)*

If you have a class that records timestamps when the data is added, then that data will be in sequence with your IDKEY values - that is, TimeStamp< TimeStampif and only if ID1 < IDfor all IDs and TimeStamp values in table - then you can use this knowledge to increase performance for queries against TimeStamp ranges.  Consider the following table:

Class User.TSOrder extends %Persistent 
{ 

Property TS as %TimeStamp;

Property Data as %String (MAXLEN=100, MINLEN=200);

Index TSIdx on TS;

Index Extent [type=bitmap, extent];

}

Populating this with 30,000,000 random rows with dates over the last 30 days, will give you 1,000,000 rows per day.  Now if we want to query the information for a given day you might write the following

SELECT ID, TS, Data 
FROM TSOrder
WHERE 
     TS >= '2016-07-01 00:00:00.00000' AND 
     TS <= '2016-07-01 23:59:59.999999'

A reasonable query, to be sure.  On my system, however, this took 2,172,792 global references and 7.2 seconds.  But, knowing that the IDs and TimeStamps are in the same order, we can use the TimeStamps to get an ID range.  Consider the following query:

SELECT ID, TS, Data
FROM TSOrder
WHERE 
     ID >= (SELECT TOP 1 ID FROM TSOrder WHERE TS >='2016-07-01 00:00:00.00000' ORDER BY TS ASC) AND 
     ID <= (SELECT TOP 1 ID FROM TSOrder WHERE TS <='2016-07-01 23:59:59.999999' ORDER BY TS DESC)

The new query completes in 5.1 seconds and takes only 999,985 global references**!  

This technique can be applied more pragmatically to tables with more indexed fields and queries that have multiple WHERE clauses.  The ID range generated from the subqueries can be put into bitmap format, generating blazing speed when you get a multi-index solution.   The Ens.MessageHeader table is a great example where you can put this trick to work.

Let's be clear - this is an EXAMPLE of a win.  If you have many conditions in the WHERE clause in the same table (and they are indexed, duh!), then this technique can give you BIGGER wins!  Try it out on your queries!  

* SQL Developers don't really hate this, but if the internet has taught us anything is that catchy blurbs get more traffic.

** When testing queries that return so many rows, the SMP cannot handle it, and most of the time is taken in displaying the data.  The proper way to test is with embedded or dynamic SQL, running through the results, but not outputting them for time, and using the SQL Shell for your global counts.  You could also use SQL Stats for that.

Discussion (9)3
Log in or sign up to continue

The premise is that when you have a timestamp property that’s set at the time of row insert, there will be a guarantee of those timestamps being “in order” with respect to Row IDs.

At first glance that sounds reasonable, and is probably almost always true, but I’m not sure it’s guaranteed.  Isn’t there a race condition around saving the timestamp and generating the new row ID?

That is, couldn’t you have a flow like this:

Process 1, Step 1:  Call %Save.  In %OnSave:  set ..CreatedAt = $zts  (let’s say this gives me 2018-06-01 16:00:00.000)

Process 2, Step 1:  Call %Save.  In %OnSave:  set ..CreatedAt = $zts   (let’s say this gives me 2018-06-01 16:00:00.010)  << +10ms

Process 2, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 1)

Process 1, Step 2: Generate new Row ID using $increment, and complete %Save (let’s say this gives me RowID = 2)

Is that likely?  Definitely not, but I don't think it's impossible.


Actually, it might be fairly likely in an ECP environment where multiple ECP Clients are inserting data into the same table, one reason being that system clocks could be out of sync by a few milliseconds.


Does that make sense, or am I missing something?  For example, would this all be okay unless I did something dumb with Concurrency?  If so, would that still be the case in an ECP environment?

Here is my solution, which has a number of advantages:

  1. there are no restrictions to the order of ID
  2. you can search in queries for the entire timestamp or for parts of it
  3. standard, reliable and proven Caché features are used
  4. this works very, very fast.

See my article for details: Indexing of non-atomic attributes

Class dc.TSOrder Extends (%Persistent%Populate)
{

Index Extent [ ExtentType = bitmap ];

Index iSmartTS On (TS(KEYS), TS(ELEMENTS));

Index iTS On TS;

Property TS As %TimeStamp(MAXVAL "2016-07-31 23:59:59.999999"MINVAL "2016-07-01 00:00:00.000000");

Property Data As %String(MAXLEN 200MINLEN 100);

ClassMethod TSBuildValueArray(
  value,
  ByRef arrayAs %Status
{
  value="" {
    array(0)=value
  }else{
    date=$p(value," ",1),
    time=$p(value," ",2),
    array("date")=date,
    array("time")=time,
    array("ddhh")=$p(date,"-",3)_"-"_$p(time,":",1)
  }
  q $$$OK
}

/// d ##class(dc.TSOrder).Fill()
ClassMethod Fill(= {30e6})
{
  ..%KillExtent(), ..Populate(N), $system.SQL.TuneTable($classname(),$$$YES), $system.OBJ.Compile($classname(),"cu-d")
}

}

Results from SMP:

select distinct null from dc.TSOrder where TS between {ts '2016-07-01 00:00:00.00000'} AND {ts '2016-07-01 23:59:59.999999'}

Performance: 2.339 seconds  2109755 global references 14768692 lines executed
(the number of selected records is 968476, used the normal index)

select distinct null from dc.TSOrder where for some %element(TS) (%key='date' and %value '2016-07-01')

Performance: 2.269 seconds  1936962 global references 15496098 lines executed
(the number of selected records is 968476, used the "smart" index)

select distinct null from dc.TSOrder where for some %element(TS) (%key='ddhh' and %value '01-13')

Performance: 0.096 seconds  80488 global references 644270 lines executed
(the number of selected records is 40239, used the "smart" index)

The purpose of the `order by` in the subqueries is to ensure you get the right ID back - otherwise you'll get any ID that matches the criteria.  Alternatively you could change the query to use `MIN(ID)` / `MAX(ID)` instead of `TOP 1 ID`; then you can drop the `order by` clause (that may be faster too for older SQL versions... looking at the current version it seems SQL generates the same query plan for both, so the engine's spotted this optimisation interally).

There are risks with this approach though - say you introduce an additional filter, you'd need to include that on both your main query and your sub queries to ensure things match up.

Additionally, you rely on the ID and Date always being in sync... Sometimes even if that should be true, it's not (e.g. maybe someone's imported data from another system, keeping the dates from the original system; but without updating all of the IDs, because they're just IDs and should have no functional meaning)...  Now your assumption's wrong, and you've got a really hard to trace bug for anyone unaware of this trick.

A cleaner approach may be to create an index over your date column (with additional columns in there if they're also used in the query). The additional overhead of an index will generally be pretty minimal; especially if it's only on the date field, so new rows will always be added on the end.  The benefit will be significant when you're doing simple `>=`, `<=` or `between` type operations.  You may even find it performs better, as whilst the ID is typically the clustered index (and so lives in the same file as the table's data), this custom index can be put on seperate storage (i.e. a separate file to the table's data), so may have less contention / better performance.