2021.2 SQL Feature Spotlight - Run Time Plan Choice
The 2021.2 release of the InterSystems IRIS Data Platform includes many exciting new features for fast, flexible and secure development of your mission-critical applications. Embedded Python definitely takes the limelight (and for good reason!), but in SQL we've also made a massive step forward towards a more adaptive engine that gathers detailed statistical information about your table data and exploits it to deliver the best query plans. In this brief series of articles, we'll take a closer at three elements that are new in 2021.2 and work together towards this goal, starting with Run Time Plan Choice.
It's hard to figure out the right order to talk about these (you can't imagine how often I've reshuffled them in writing this article!) because they fit together in such a nice way. As such, feel free to go on a limb and read these in random order .
About IRIS query processing
When you submit a statement to the IRIS SQL engine, it will parse the statement into a normalized form, substituting out all literals (query parameters) and then look at your table structure, indices and statistics about field values to figure out the most efficient execution strategy for the normalized query. This enables the engine to reuse the same plan and generated code when you wish to execute the query again, possibly using different query parameter values. For example, take the following query:
SELECT * FROM Customer WHERE CountryCode = 'US' AND ChannelCode = 'DIRECT'
This query will be normalized to a form like this:
SELECT * FROM Customer WHERE CountryCode = ? AND ChannelCode = ?
so that subsequent invocations for different combinations of country and channel can immediately pick up the same cached query class, saving the compute-heavy planning work. Let's assume we sell running shoes through half a dozen channels and sell all around the world; in other words the data is evenly distributed across the possible values for CountryCode and ChannelCode. When we have a regular index on both these fields, the most efficient plan for this normalized query will start with the most selective condition (CountryCode), using the corresponding index to access matching rows from the master map and then check the other condition (ChannelCode) for every row in the master map.
Now, say we are not a sports gear manufacturer, but a specialist vendor of moulds for Belgian chocolates (where'd I get that from? ). In this case, let's assume the majority of my customers (say 60%) are in Belgium, meaning 'BE' becomes an outlier value for the CountryCode field, representing a large number of the rows in my table. Suddenly that index on CountryCode gets a different use: if the query parameter for CountryCode we get at runtime is 'BE', using that index first would mean I'd be reading the majority of my master map, so starting with the index on ChannelCode would be the better thing to do. However, if the query parameter value for CountryCode is another value, that'd make the index on CountryCode much more worthwhile again, as all other countries divide up that remaining 40% non-Belgian customers.
This is an example where you'd want to choose a different plan at runtime; or, reshuffling those words: Run Time Plan Choice. RTPC is a mechanism that adds a small hook into the classic literal substitution and cached query lookup logic to spot for outliers such as the 'BE' value for our CountryCode column. If you're interested in a more detailed overview of IRIS SQL query processing, please check this VS2020 video.
IRIS SQL supported a very crude opt-in version of this in the past, but 2021.2 introduces an all-new RTPC infrastructure that is much more lightweight and able to kick in for a broader variety of predicate conditions. Having established the overhead of this runtime check is indeed minimal, we decided to switch it on by default, so there's nothing for you to do in order to benefit from it (other than unfreezing your query plans after upgrade as usual).
We've often seen how common outliers are in real-world datasets (and how rare a strictly uniform distribution is) and tests on a partner benchmark have shown a spectacular improvement in performance and I/O, as you can see in the chart below. We also included the benchmark results for 2020.1, just so you could also appreciate our continuous efforts (and results!) to improve performance release over release.
Mileage will vary, depending on the amount of outliers in your dataset and availability of indices, but we're pretty excited about the potential of this change and are very curious to hear your experiences.