When to use Columnar Storage

Article

Benjamin De Boe · Feb 13, 2023 4m read

#Analytics #Columnar Storage #SQL #Vector Search #InterSystems IRIS

With InterSystems IRIS 2022.2, we introduced Columnar Storage as a new option for persisting your IRIS SQL tables that can boost your analytical queries by an order of magnitude. The capability is marked as experimental in 2022.2 and 2022.3, but will "graduate" to a fully supported production capability in the upcoming 2023.1 release.

The product documentation and this introductory video, already describe the differences between row storage, still the default on IRIS and used throughout our customer base, and columnar table storage and provide high-level guidance on choosing the appropriate storage layout for your use case. In this article, we'll elaborate on this subject and share some recommendations based on industry-practice modelling principles, internal testing, and feedback from Early Access Program participants.

Generally, our guidance on choosing an appropriate table layout for your IRIS SQL schema is as follows:

If you’re deploying an application that leverages IRIS SQL or Objects, such as an EHR, ERP or transaction processing application, there is no need to change its current row storage layout to a columnar one. Most SQL queries issued for end user applications or programmatic transactions only retrieve or update a limited number of rows, and result rows usually correspond to table rows, with very limited use of aggregate functions. In such cases, the benefits offered by columnar storage and vectorized query processing don’t apply.
If such an application also embeds operational analytics, consider adding columnar indices if the corresponding analytical queries’ current performance is not satisfactory. This includes, for example, dashboards showing the current inventory or basic financial reporting on live data. Look for numeric fields used in aggregations (e.g. quantities, currencies) or high-cardinality fields used in range conditions (e.g. timestamps). A good indicator for such opportunities is current use of bitmap indices to speed up the filtering of large numbers of rows, usually on low-cardinality fields (e.g. categorical or ordinal fields). There is no need to replace these bitmap indices; the additional columnar indices work well in conjunction with them and are meant to avoid excessive reads from the master map or regular index maps (single gref per row).
If your IRIS SQL tables contain less than a million rows, there is no need to consider columnar storage. We prefer not to pin ourselves to specific numbers, but the benefits of vectorized query processing are unlikely to make a difference in these low ranges.
If you’re deploying an IRIS SQL schema for Data Warehouse, Business Intelligence, or similar analytical use cases, consider changing it to default to columnar storage. Star schemas, snowflake schemas or other denormalized table structures as well as broad use of bitmap indices and batch ingestion are good indicators for these use cases. Analytical queries that will benefit most from columnar storage are those that scan large numbers of rows and aggregate values across them. When defining a “columnar table”, IRIS will transparently resort to a row layout for columns in that table that aren’t a good fit for columnar storage, such as streams, long strings or serial fields. IRIS SQL fully supports such mixed table layouts and will use vectorized query processing for eligible parts of the query plan. The added value of bitmap indices on columnar tables is limited, so they can be left out.

Mileage will vary based on both environmental and data-related parameters. Therefore, we highly recommend customers test the different layouts in a representative setup. Columnar indices are easy to add to a regular row-organized table and will quickly yield a realistic perspective on query performance benefits. This, along with the flexibility of mixed table layouts, is a key differentiator of InterSystems IRIS that helps customers achieve an order-of-magnitude performance improvement.

We intend to make these recommendations more concrete as we get more real-world experience on the full production release. Obviously, we can provide more concrete advice based on customers’ actual schema and workload through the Early Access Program and POC engagements, and look forward to feedback from customers and community members. Columnar Storage is part of the InterSystems IRIS Advanced Server license and also enabled in the Community Edition of InterSystems IRIS and IRIS for Health. For a fully scripted demo environment, please refer to this GitHub repository.

Benjamin De Boe · Feb 17, 2023

Hi @Pravin Barton , that's actually a great question and I was wondering whether or not to add a note about that in the article, but then thought it might distract from the SQL focus.

IRIS BI indeed pre-computes aggregates and builds an array of bitmap indices for lightning-fast filtering and partitioning of MDX query results. MDX offers a syntax that lends itself well for analytics, with the structure of a query natively thinking in terms of aggregates rather than rows, so it's targeting the same kinds of analytical questions as Columnar Storage. IRIS BI comes with a UI surface as well, with Analyzer for ad-hoc querying (you could say it's a visual MDX query builder) and a dashboard component that is meant to be easily embeddable in your applications. For SQL, there are plenty of third-party BI tools offering eye candy of a sweetness your dentist couldn't possibly approve.

I'd say the main difference between the two, ignoring that UI aspect for a second, is that Columnar Storage applies more upstream, acting as the storage model for your master copy of the data, where IRIS BI is by definition an analytics-optimized copy of that data. If your database exists for the purpose of analytics (e.g. a data warehouse), you could as well have the master copy use columnar storage outright and enjoy the flexibility of SQL on the query / UI side. If you're doing analytics on operational data, the application schema may be a little more normalized and not a great fit for Columnar Storage or direct querying, which is where IRIS BI offers an easy way to create an analytical model with measures and dimensions, and populate that automatically using cube build / sync (those are just buttons, not ETL flows you have to code!).

Two more points that'll hopefully not add too much confusion:

We'll be looking into ways how IRIS BI can take advantage of some of the foundational elements of Columnar Storage, namely the new $vector internal datatype we introduced. We expect it'll be a great fit for your cube measures as it'll work well with the existing bitmaps we build for dimensions. The R&D for that is planned later this year, after we've released production readiness for the SQL side in 2023.1, and after that we'll happily brag about the gains (and document tradeoffs) here :-).

Columnar Indices are secondary storage for table data where the primary storage uses row format. That makes them very similar in purpose to IRIS BI in the way I described it earlier, so when to use what?

First there is the general schema layout aspect: if your application schema is in full third normal form, the columnar index' benefits may not outweigh the cost of a five-way JOIN to get the relevant data together. The IRIS BI cube infrastructure is better suited to deal with that.
A table's indices are maintained with every INSERT, UPDATE or DELETE statement, which means the overhead to build your analytical-query-accelerator now needs to be paid when you're filing rows in the table, rather than when you press the build / sync button on your cube (which could be in a trigger too, of course).
But probably the most important question is how you'd like to query it. MDX can look a bit daunting at first, but the IRIS BI infrastructure and embeddability has been used successfully by many of our customers. SQL, on the other hand, is more widely understood by data analysts and their client BI tools

In any case, the proof of the pudding is in the eating. Adding a columnar index and giving your queries a quick try is really simple, and your InterSystems account team (in your case we're one another's account team ;-) ) is happy to help with that.

3 0