Great article Chad!

FWIW, we're working on a faster version of ^%GSIZE (with an API more fitting the current century ;-) ) that uses stochastic sampling similar to the faster table stats gathering introduced in 2021.2? I'll also take the opportunity for a shameless plug of my SQL utilities package, which incorporates much of what's described in this article and will take advantage of that faster global size estimator as soon as it's released.

Hi @Lorenzo Scalese , you're spot on with those caveats about stats currently being part of the code. I wrote this article on that subject a while ago, and you'll see at the end that we're planning to move these stats into the extent to live where they belong: with the data rather than the code. 

Just one note: while AutoTune tries to make sure you have at least some stats before planning a first query to a new table (when fast block sampling is available), nothing prevents you from gathering table stats again after your table is fully loaded, after a big update, in a post-install script, ...

Hi @Pravin Barton , that's actually a great question and I was wondering whether or not to add a note about that in the article, but then thought it might distract from the SQL focus.

IRIS BI indeed pre-computes aggregates and builds an array of bitmap indices for lightning-fast filtering and partitioning of MDX query results. MDX offers a syntax that lends itself well for analytics, with the structure of a query natively thinking in terms of aggregates rather than rows, so it's targeting the same kinds of analytical questions as Columnar Storage. IRIS BI comes with a UI surface as well, with Analyzer for ad-hoc querying (you could say it's a visual MDX query builder) and a dashboard component that is meant to be easily embeddable in your applications. For SQL, there are plenty of third-party BI tools offering eye candy of a sweetness your dentist couldn't possibly approve.

I'd say the main difference between the two, ignoring that UI aspect for a second, is that Columnar Storage applies more upstream, acting as the storage model for your master copy of the data, where IRIS BI is by definition an analytics-optimized copy of that data. If your database exists for the purpose of analytics (e.g. a data warehouse), you could as well have the master copy use columnar storage outright and enjoy the flexibility of SQL on the query / UI side. If you're doing analytics on operational data, the application schema may be a little more normalized and not a great fit for Columnar Storage or direct querying, which is where IRIS BI offers an easy way to create an analytical model with measures and dimensions, and populate that automatically using cube build / sync (those are just buttons, not ETL flows you have to code!).

Two more points that'll hopefully not add too much confusion:

We'll be looking into ways how IRIS BI can take advantage of some of the foundational elements of Columnar Storage, namely the new $vector internal datatype we introduced. We expect it'll be a great fit for your cube measures as it'll work well with the existing bitmaps we build for dimensions. The R&D for that is planned later this year, after we've released production readiness for the SQL side in 2023.1, and after that we'll happily brag about the gains (and document tradeoffs) here :-).

Columnar Indices are secondary storage for table data where the primary storage uses row format. That makes them very similar in purpose to IRIS BI in the way I described it earlier, so when to use what?

  • First there is the general schema layout aspect: if your application schema is in full third normal form, the columnar index' benefits may not outweigh the cost of a five-way JOIN to get the relevant data together. The IRIS BI cube infrastructure is better suited to deal with that.
  • A table's indices are maintained with every INSERT, UPDATE or DELETE statement, which means the overhead to build your analytical-query-accelerator now needs to be paid when you're filing rows in the table, rather than when you press the build / sync button on your cube (which could be in a trigger too, of course). 
  • But probably the most important question is how you'd like to query it. MDX can look a bit daunting at first, but the IRIS BI infrastructure and embeddability has been used successfully by many of our customers. SQL, on the other hand, is more widely understood by data analysts and their client BI tools

In any case, the proof of the pudding is in the eating. Adding a columnar index and giving your queries a quick try is really simple, and your InterSystems account team (in your case we're one another's account team ;-) ) is happy to help with that.

%SYS.Audit already projects to each namespace, but the data is in a privileged database and requires %Admin_Secure:USE privileges, so better not try to circumvent / confuse that through mappings.

I would not create additional temp tables or mappings to try and cling to SQL as that is meant to run in a single namespace as a query language. This sounds more like a procedural thing for ObjectScript, so a stored procedure that loops through the Production namespaces, retrieves all config items in a local variable and then uses that for querying the audit table after switching to %SYS sounds more appropriate. Rather than a PPG, I'd just collect it in the qHandle local variable passed between Execute() and Fetch() methods for your custom query, unless it risks getting too big

Can you provide a little more context on what your subqueries are trying to accomplish? Conceptually, it feels a little unnatural to force PID awareness into SQL (sub)queries, so maybe this is more an application scenario asking for more direct interaction with the Work Queue Manager. 

In 2022.2, we introduced an Application Metadata Stack that SQL (and in the future other subsystems or application code) uses to stash a little context in the process metadata in order to build things like the SQL Process View. Amongst other things, that metadata helps us tie subquery processes to their parent. Right now we have no public API to interpret this rather technical metadata directly from the stack, but maybe you can convince us there's a use case :-)

Nice visionary article!

Luckily we didn't rush this 13 years ago ;-) as the model described in there would offer some benefits, but not get to the order of magnitude performance improvement we're bringing with this new IRIS feature. Organizing data in separate nodes as proposed there may help somewhat to limit the size of each read (read the column value rather than the full row's $list), but you're still doing NxM small grefs to read M columns for N rows. When you're talking billions of rows, that sheer volume of grefs is the bottleneck we saw preventing us from doing really well on analytics queries. 

With Columnar Storage, the new $vector datatype we introduced that encodes 64k values into a single global node offers much better economies on today's hardware, and also enable SIMD instructions on those, which you couldn't get by keeping values in separate variables. Of course introducing all-new datatypes (and especially internal ones like $vector) isn't something you could do using ObjectScript in user code, but I promise we'll offer Key-Value before IRIS 2036.1! :-)

For the record: it's not the Abstract-ness that leads to the index being ignored, it's the fact that the index is not defined in the primary superclass of Test.NewClass1. From secondary superclasses, only parameters, properties and methods are inherited, but not indices. So you can keep Test.NewClass Abstract, but make it inherit from %Persistent itself and then have Test.NewClass1 point to it as its primary superclass. I'll ask the doc team to make this a little clearer, as it's not obvious from this section of the doc nor the one on indices right now.

This said, @Robert Cemper 's suggested use of NoExtent is likely what you wanted to achieve.

Yes, it always behaved like this. Currently, table stats are stored as part of the class definition, based on the idea that they are often kind of static and belong with the application. That holds for certain types of applications, but in many cases it is a little too static and this is especially true for the Message Bank example here, as TuneTable cannot update class definitions for a read-only system class.

We have a project coming up to move the table stats to live with the data rather than keep them in the class definition and hope to bring that to an IRIS release in the course of 2023. In the meantime, the workaround suggested above, marking ENSLIB as writable, running TT and then marking it read-only again is reasonable (though quite a step from an official recommendation). Note that after upgrades, your ENSLIB will be overwritten and your stats will have been lost.

FYI, the last part of this article also touches on this subject

In the case of JDBC, there's also the getGeneratedKeys() method that you may find useful.

We're looking into support for a RETURNING clause for the INSERT command as a possible enhancement (Postgres' flavour of this nonstandard clause is very nice), but don't hold your breath as we have a long list of such candidate enhancements and a much shorter list of people who can work on the surprising amount of places in our codebase that would need to be updated to accommodate such a syntax extension.

There is apparently an internal-design-level distinction between UDL as an editor format and XML as an export format. We appear to have closed the case without clarifying the documentation on this point (apologies for this oversight), but I've reopened the case so we might review and reconsider the current behaviour more broadly. There's a number of additional flags like these, so it's possibly a larger change.

@Vitaliy Serdtsev is right in pointing to the resultset's metadata. That's where IRIS SQL registers column datatypes and he already pointed out that this is obviously not impacted by the presence of an ORDER BY. 

The resultset object returns the SQL-side value to ObjectScript, but on that ObjectScript side, the datatype no longer matters as it isn't strongly typed and still 1="1". Therefore, I don't think this constitutes an error.

FWIW, the reason you're seeing this is that due to the ORDER BY clause we're picking up that id value from the index' subscript rather than from the master map.