InterSystems IRIS data platform: Architecture

Article

Developer Commu... · Mar 18 10m read

The InterSystems IRIS data platform underlies all InterSystems applications, as well as thousands of customer and partner applications across Healthcare, Financial Services, Supply Chain, and other ecosystems. It is a converged platform, providing transactional-analytical data management, integrated interoperability, and data integration, as well as integrated analytics and AI. It supports the InterSystems Smart Data Fabric approach to managing diverse and distributed data.

At the core of our architecture are facilities for high-performance, multi-model, multi-lingual data processing in our core data engine, also known as the Common Data Plane. Around that lives a remarkable facility for scaling out extremely high volumes of data and high transaction rates that can reach over a billion database operations per second.

Next are two major subsystems: one that focuses on analytics and artificial intelligence (AI) and another that focuses on interoperability and data integration. These subsystems follow our fundamental philosophy of running everything close to the data to provide high performance with a minimal footprint.

Finally, around the subsystems, we have built a smart data fabric that enables customers to solve complex problems in a single stack. The following sections explore these layers and how they interact to give a better sense of what makes InterSystems IRIS technology so special.

Famous for its performance, the core of InterSystems technology is a highly efficient mechanism for data storage, indexing, and access. Unlike other database providers, we do not provide a natively relational or document database. We use an underlying storage format called globals. They are modeled in a highly optimized, multi-dimensional array-style format that is built as a B+ tree that is automatically indexed with every operation.InterSystems.com Built at a layer below data models — such as relational, object, or document — a single storage format is projected into different data formats and models. This is referred to as a Common Data Plane.

The underlying global format is highly efficient and translatable to many different data models:

Globals (denoted with an up-caret “^” prefix) can have many subscripts, each of which can be numeric, alphanumeric, or symbolic. Globals are powerful and represent data in a general way that simultaneously supports many data paradigms with a single copy of the data. Cases like associative and sparse arrays are easy to process in this approach. We also encode in the storage format itself, using encodings (denoted with a dollar sign “$” prefix) that provide a small footprint and low latency because of the disk and I/O optimizations. The format of these encodings is the same in memory, on disk, or on the wire. This minimizes the transformations involved in ingesting data and achieves amazing speeds expected from an in-memory database, but with persistence typical of a diskbased database.

An example of how a single global can support multiple data models is illustrated by a case where you are using SQL or BI tools and want to access the data in a relational format, in tables with rows and columns. If you are doing object-oriented development, however, we automatically project those objects into globals and subsequently project that data into relational format. Similarly, we can project JSON or other document formats into a relational form.

This capability means that rather than having multiple data stores, one relational, another object, and another document, and stitching them together we have one copy projected to all these different forms, without duplication, moving, or mapping. From this also comes a convenient combination of schema-on-write and schema-on-read. As with a data lakehouse, you can depend on a level of structure like a data link, after inserting the data and figuring out the best schema for that data based on its current use. This global structure works well for structured data, as well as for documents and semi-structured or unstructured data.

A few encodings, engineered very tightly, are used to store data and indices efficiently.

While lists are the default storage encoding, InterSystems IRIS may represent data and indices in one or more of these encodings based on the data characteristics and/or the developers’ specifications. Vectors store a large number of the same datatype efficiently and are used for columnar storage in analytics, for vector search, for time series, and for more specialized cases. Packed-value arrays (known as $pva) are ideal for documentoriented storage. Bitmaps are used for Boolean data and for highly efficient bitmap indices.

All these data structures are automatically indexed in a highly optimized update upon every operation. Many successful customers have used built-in indexing to carry out low-latency, full-transactional steps, like the “billion database operations per second” mentioned earlier. Such consistent indexing, performed almost instantly, gives us consistent, low-latency access to all data in any format. 6 Multi-model facilities made possible by the underlying global format are virtually instantaneous because there is only one copy of the data to change, and thus no time or space needed for data replication. This also grants major advantages in ingestion speed, reliability, and scale-out.

The system can combine encodings. The multi-lingual capability that globals provide means that you can work in the programming language of your choice, with effortless access to all needed formats. Clearly true with relational access through standards like JDBC and ODBC, it is also true of automatic matching of objects in .NET or in Java to an underlying format. From a development perspective, you do not need to worry about the object relational mapping; you just work with an object, and we take care of the storage format.

Around the core data engine is layered a distributed cache coming with built-in consistency guarantees. This cache uses our Enterprise Cache Protocol, or ECP, and satisfies textbook guarantees for consistency under distributed data and failure. ECP builds in these consistency rules to maintain data integrity across a distributed system even in the presence of failures, encapsulating them directly.

In other words, the performance of the distributed data stays high, even at scale. You can spread these ECP nodes for horizontal scaling, managing higher throughput. You can also spread them for data distribution, meaning that you can have in-memory performance without having to live within the memory available for any node.

ECP works especially well in the cloud because of its scale-out. We’ve built that into our InterSystems Kubernetes Operator (IKO) to provide auto scaling, and we can transparently add and remove nodes using ECP to the application. Scaling out like this is essentially linear, and you can independently scale out the ingestion versus the data processing versus the data storage and optimize for your workload. Because ECP is robust to changes in topology, a node can die without affecting transaction processing. You can add nodes on the fly, and they can pick up the load. That provides seamless elasticity, meaning you can size things dynamically and enjoy a net lower cost. ECP is transparent to the application; no changes are needed to scale out any application. Customers also have the flexibility to associate specific workloads with specific sets of notes in an InterSystems IRIS cluster. For example, reporting or analytics workloads might be assigned to one pod, and transaction-heavy workloads to another.

The next layer of InterSystems IRIS architecture is a built-in interoperability subsystem. It integrates data across messages, devices, and different APIs. It also integrates bulk data, in either the ETL or the ELT (extract-transform-load or extract-load-transform) patterns. InterSystems IRIS Interoperability uses the common data plane as a built-in repository for all elements of message handling and data integration. This benefits from the performance and reliability of the first two layers, as well as the multimodel capabilities. For example, bulk structure data tends to be relationally oriented, and many messaging protocols tend to be document oriented.

By default, interoperability is persistent – meaning that data messages and transformations are stored within the system for auditing, replay, and analytics. Unlike many other interoperability middleware offerings, delivery can be guaranteed, traced, and audited across the board. You can confirm that a message was delivered or see who sent what to whom, the type of information important for both analytics and forensics. The general paradigm for InterSystems IRIS Interoperability is object-oriented. This aids in creation and maintenance of adapters: object inheritance minimizes the effort required to build any needed custom adapters, including testing. It also helps with the creation and maintenance of data transformations. As shown in Figure 8, use of a common object can dramatically reduce the number of transformations needed between different data formats or protocols. Rather than building and maintaining a data transformation for each pair, a single transformation for each data format into a common object provides a simpler approach that is easier to test and maintain.

Within the InterSystems IRIS Interoperability subsystem, there is a wide range of integration scenarios across messages, devices, and APIs.

This interoperability includes built-in full lifecycle API management, streaming facilities, IoT integration, compatibility with cloud services, and more. We also provide dynamic gateways in multiple languages, enabling high performance integration of existing applications into these data flows in the language of your choice.

InterSystems IRIS Interoperability sits alongside a set of built-in analytics and AI facilities.

Each of these capabilities runs “close to the data,” meaning that in general we bring processing to the data rather than, at considerable cost and delay, move data to the processing.

Several analytics facilities have been built into InterSystems IRIS. One is InterSystems IRIS BI, which is a MOLAP-type, cube-based architecture for business intelligence (BI), optimized for latency. Because this set of subsystems is built into InterSystems IRIS, we can trigger on SQL and events in the cube with only 10-20 milliseconds from data to dashboard. Having a single copy of data across transactions and analytics helps keep this latency low. Because ECP allows one set of nodes to operate on analytics in isolation from the transactional workload, analytics poses no risk to transactional responsiveness, while there is never a need for more than a single copy of the data.

Another facility is Adaptive Analytics, which, unlike InterSystems IRIS BI, does not use prebuilt cubes. It dynamically optimizes and builds virtual cubes as it goes, making these available for both BI and Adaptive Analytics is a ROLAP-type headless analytics facility that includes seamless integration with all leading BI tools, such as Tableau, PowerBI, Qlik, Excel, and others.

Alongside the analytics facilities are several ML and AI facilities.

Integrated ML allows you to write automatic machine learning (ML)-style models using SQL. You simply write an SQL command, then create, train, validate, and predict with the model. The results can be directly used in SQL. Thus, developers familiar with SQL can use ML predictions in their applications.

Python sits directly within the kernel of the data platform, so it runs directly against the data with maximum performance. You do not need to port from a development or lab environment where you build models into a production environment where you run those models. You can build and run in the same cluster and therefore have assurance that what you have built and what you run use the same data in the same format and are therefore consistent. Data science projects are simple and fast.

InterSystems IRIS embedded vector search capabilities let you search unstructured and semi-structured data. Data is converted to vectors (or embeddings), then stored and indexed in InterSystems IRIS for semantic search, retrieval-augmented generation (RAG), text analysis, recommendation engines, and other use cases.

These layers - the core data engine, the ECP layer to scale out Interoperability, and our analytics facilities - are part of our unique ability to power a Smart Data Fabric architecture. Data fabric is an architectural pattern that provides common governance over a wide variety of data and data sources. A common pattern for a data fabric is to bring in data from multiple sources; normalize, deduplicate, cross-correlate and improve the data; and then make it available for a variety of different applications:

Within most data fabrics, there are multiple capabilities including ingestion, pipelining, metadata, and more. What makes the InterSystems approach smart is the inclusion of analytics and AI within the data fabric:

One of the key tenets of InterSystems technology is “connect or collect.” Some facilities within InterSystems IRIS, like foreign tables or federated tables, let you work or “connect” with data where it lies. Or you can choose to collect that data.

InterSystems IRIS is agnostic with respect to cloud provider and runs on premises, in the cloud of your choice, in heterogeneous and hybrid scenarios, or in multicloud environments. The fastest growing part of our business is our cloud services, which are available across multiple clouds. The flexibility to run wherever you want to deploy is key. That distinguishes InterSystems IRIS from, for example, the facilities provided by the cloud vendors themselves or many of the current options for data warehouses. You can run InterSystems IRIS and applications built with it wherever you want. Of course, InterSystems IRIS itself is available as a cloud managed service.