SQL query optimization in IRIS

Hello to all the esteemed members of the InterSystems developer community in Spanish!
Although I often consult the developer community and have occasionally left a question here, this will be my first article, and what better occasion to do so than by participating in the 3rd technical article contest.
The topic I will cover in this article is the different tools that InterSystems IRIS provides us to optimize both the databases and the queries we run, as well as the new types of storage and data.
Optimizing our SQL queries
For this article, I have set up a small example in an InterSystems IRIS Community instance, importing a CSV file with 5 million records of hypothetical sales made over the years in different cities.
The table structure will be very simple, but sufficient for the tests we need. Please excuse the names, but you know that nothing is more permanent than a temporary solution.
Sales.Record
Sales.Record
| SumaVentasMensuales | CodigoCiudad | Anyo | Mes |
| 454323 | 1 | 1995 | 3 |
| ... | ... | ... | ... |
Concurso.Localidad
| CodigoCiudad | Ciudad | CodigoComunidad | Comunidad |
| 1 | Valladolid | 1 | Castilla y León |
| ... | ... |
We will try to extract the historical average sales for a city like Valladolid in a given year:
.png)
Let's see how long our query takes to return the average monthly sales:

The result was a little over 8 seconds to return the average monthly sales value. As you can see, we have an option that allows us to see the query execution plan ( Show plan ). Let's see step by step what our query will do:
.png)
Let's look at module H, which contains some quite interesting information:
• Divide master map Concurso.Localidad(L).IDKEY into subranges of IDs.
• Call module A in parallel on each subrange, piping results into temp-file D.
These two lines tell us that we will first divide the Concurso.Localidad table by ID ranges and call module A in parallel for each range obtained. This parallel computation, implemented by default in IRIS, will speed up our query without needing to specify it.
As you can see, our query is quite expensive. Why is that? As you can see, our query is traversing the entire Concurso.Ventas table with its 5 million entries, and for each row, it's applying the condition that the city is "Valladolid" and the year is a specific value. As you can imagine, this isn't the most efficient approach. How could we improve performance?
Adding indexes
As you may know, indexing a column will allow us to improve the performance of queries that contain a condition on that column, so we are going to try including an index on our Anyo column in our Sales.Record table:
.png)
We created our index as a BITMAP because it will contain a fairly small set of different values and will improve the performance of our queries on a normal index ( you can read the associated documentation here ).
This index won't change the execution plan, but it will speed up the query. Let's see the result:
.png)
As we can see, our performance has improved. Let's take a look at the consultation plan with the index included:
.png)
As you can see, in module B we no longer read the entire Concurso.Ventas table with its 5 million records, but instead, using the index, we first extract all those values with the Year 1996, thus excluding the non-matching ones from the reading.
A good improvement, isn't it? But let's not stop here, we can improve even more with the...
Query optimizer
InterSystems IRIS provides us with a default query optimizer that will use certain tools to define the query plan; it is not the same to perform a query directly on the table with 5 million sales records as it is to do it first on the location table and then on the sales table excluding the records that do not belong to the requested location.
The simplest way to optimize tables is by running a simple TUNE TABLE command on the table you want to optimize. This command will configure the following parameters in your table's class definition:
- ExtentSize : This parameter will contain the number of rows in the table. We can define it ourselves by putting any value we want, as long as it maintains a proportion with the actual values compared to other tables.
- Selectivity : the percentage that each different value in the column represents with respect to the total, that is, if we have only 2 types of values, its Selectivity will be 50%, if we have 64 as in the case of the locations it will be 1.56% if we do not have repeated values.
- BlockCount : with the estimated value of map blocks that will be used for each SQL map based on the ExtentSize.
Let's take a look at the Storage section of the Contest.Sales class before applying the tuning to the table:
Storage Default
{
<Data name="VentasDefaultData">
<Value name="1">
<Value>%%CLASSNAME</Value>
</Value>
<Value name="2">
<Value>SumaVentasMensuales</Value>
</Value>
<Value name="3">
<Value>%Source</Value>
</Value>
<Value name="4">
<Value>Mes</Value>
</Value>
<Value name="5">
<Value>Año</Value>
</Value>
<Value name="6">
<Value>CodigoCiudad</Value>
</Value>
</Data>
<DataLocation>^Concurso.VentasD</DataLocation>
<DefaultData>VentasDefaultData</DefaultData>
<ExtentSize>2000000</ExtentSize>
<IdLocation>^Concurso.VentasD</IdLocation>
<IndexLocation>^Concurso.VentasI</IndexLocation>
<SQLMap name="AñoIdx">
<BlockCount>-10634</BlockCount>
</SQLMap>
<StreamLocation>^Concurso.VentasS</StreamLocation>
<Type>%Storage.Persistent</Type>
}As you can see, it has a series of default values that won't help improve our queries. Let's run the TUNE TABLE command for both tables and see what consequences it has on both our class and our query.
.png)
Let's take a look at our Storage section:
Storage Default
{
...
<ExtentSize>6468826</ExtentSize>
<Property name="Año">
<AverageFieldSize>4</AverageFieldSize>
<Histogram>...</Histogram>
<Selectivity>2.9412%</Selectivity>
</Property>
<Property name="CodigoCiudad">
<AverageFieldSize>6.7</AverageFieldSize>
<Histogram>...</Histogram>
<Selectivity>3.3333%</Selectivity>
</Property>
<Property name="Mes">
<AverageFieldSize>3</AverageFieldSize>
<Histogram>...</Histogram>
<Selectivity>8.3333%</Selectivity>
</Property>
<Property name="SumaVentasMensuales">
<AverageFieldSize>5</AverageFieldSize>
<Histogram>...</Histogram>
<OutlierSelectivity>.000393:794574</OutlierSelectivity>
<Selectivity>0.0002%</Selectivity>
</Property>
<SQLMap name="$Ventas">
<BlockCount>-48</BlockCount>
</SQLMap>
...
}Now that both Concurso.Localidad and Concurso.Ventas are correctly configured, let's see their impact on the query execution plan:
.png)
The improvement is 31% compared to the previous consultation, and its cost is less than half of the original consultation. Let's see how long our consultation takes now:
.png)
That's right, as you can see we've gone from 8 seconds initially to 4.6 including the appropriate indexes for our tables to just under 1 second by applying TUNE TABLES.
You can find more possible optimizations for your queries on this page , such as %FirstTable, which will force the optimizer to read the selected table first and which may be useful for limiting reads on large tables filtered by values from the selected one.
Columnar Storage
InterSystems IRIS not only provides us with the above tools to get the most out of our queries, but also gives us access to columnar storage functionality intended for cases like the example used for this article, large volumes of data exceeding one million records on which we want to perform aggregation operations such as SUM or AVG.
This columnar storage can be defined at the table level or at the column level, whichever is more convenient. To better understand what columnar storage is, I'll use an image from the ISC documentation.
.png)
Following the explanation, you will see how, for our example, it may be interesting to define the monthly sales column as a columnar storage type. To see the differences, I have recreated the tables from scratch without indexes and without running TUNE TABLE.
The only change needed in Concurso.Ventas for our SumaVentasMensuales column to use columnar storage will be to define the following property:
Property SumaVentasMensuales As %Integer(STORAGEDEFAULT = "columnar");Let's now check the query plan.
.png)
The cost of our query with columnar storage is 65,487,882, while with row-based storage it was 100,529,682—a substantial improvement simply by changing the storage type. Let's see how long the query takes to execute:
.png)
A little over 1 second compared to the more than 8 seconds it took with vector storage.
Closing remarks and farewell
This concludes our article on SQL query optimization with InterSystems IRIS, which I hope you found useful.