InterSystems Data Fabric Studio at your service!

Article

You've probably encountered the terms Data Lake, Data Warehouse, and Data Fabric everywhere over the last 10-15 years. Everything can be solved with one of these three things, or a combination of them ( here and here are a couple of articles from our official website in case you have any doubts about what each of these terms means). If we had to summarize the purpose of all these terms visually, we could say that they all try to solve situations like this:

Why is my room always messy? Tips & Tricks — Junk Brothers

Our organizations are like that room, a multitude of drawers filled with data everywhere, in which we are unable to find anything we need, and we are completely unaware of what we have.

Well, at InterSystems we could not be less and taking advantage of the capabilities of InterSystems IRIS we have created a Data Fabric solution called InterSystems Data Fabric Studio (are we or are we not original?).

Data Fabric

First of all, let's take a closer look at the features that characterize a Data Fabric, and what better way to do so than by asking our beloved ChatGPT directly:

A Data Fabric is a modern architecture that seeks to simplify and optimize data access, management, and use across multiple environments, facilitating a unified and consistent view of data. Its most distinctive features include:

Unified and transparent access
- Seamless integration of structured, semi-structured, and unstructured data.
- Seamless access regardless of physical or technological location.
Centralized metadata management
- Advanced data catalogs that provide information on origin, quality, and use.
- Automatic data search and discovery capabilities.
Virtualization and data abstraction
- Eliminating the need to constantly move or replicate data.
- Dynamic creation of virtual views that enable real-time distributed queries.
Integrated government and security
- Consistent application of security, privacy, and compliance policies across all environments.
- Integrated protection of sensitive data through encryption, masking, and granular controls.
AI-powered automation
- Automating data discovery, preparation, integration, and optimization using artificial intelligence.
- Automatic application of advanced techniques to improve quality and performance.
Advanced analytical capabilities
- Integrated support for predictive analytics, machine learning, and real-time data processing.

InterSystems Data Fabric Studio

InterSystems Data Fabric Studio, or IDFS from now on, is a cloud-based SaaS solution (for now) whose objective is to meet the functionalities demanded of a Data Fabric.

Those of you with more experience developing with InterSystems IRIS will have clearly seen that many of the Data Fabric features are easily implemented on IRIS, which is exactly what we thought at InterSystems. Why not leverage our technology by providing our customers with a solution?

Modern and user-friendly interface.

This is a true first for InterSystems: a simple, modern, and functional web interface based on the latest versions of technologies like Angular.

With transparent access to your source data.

The first step to efficiently exploiting your data begins with connecting to it. Different data sources require different types of connections, such as JDBC, REST APIs, or CSV files.

IDFS provides connectors for a wide variety of data sources, including connections to different databases via JDBC using pre-installed connection libraries.

Analyze your data sources and define your own catalog.

Every Data Fabric must allow users to analyze the information available in their data sources by displaying all associated metadata that allows them to decide whether or not it is relevant for further exploitation.

With IDFS, once you've defined the connections to your different databases, you can begin the discovery and cataloging tasks using features such as importing schemas defined in the database.

In the following image, you can see an example of this discovery phase. From an established connection to an Oracle database, we can access all the schemas present in it, as well as all the tables defined within each schema.

This functionality is not limited to the rigid structures defined by external databases; IDFS, using SQL queries between multiple tables in the data source, allows you to generate catalogs with only the information that is most relevant to the user.

Below you can see an example of a query against multiple tables in the same database and a visualization of the retrieved data.

Once our catalog is defined, IDFS will be responsible for storing the configuration metadata. There is no need to import the actual data at any time, thus providing virtualization of the data.

Consult and manage your data catalog.

The data set present in any organization can be considerable, so managing the catalogs we create based on it is essential to be agile and simple.

IDFS allows us to consult our entire data catalog at any time, allowing us to recognize at a glance what data we have access to.

As you can see, with the functionalities already explained, we perfectly cover the first two points that ChatGPT indicated as necessary for a Data Fabric tool. Let's now see how IDFS covers the remaining points.

One of the advantages of IDFS is that, since it is built on InterSystems IRIS, it leverages its vector search capabilities, which allow semantic searches across the data catalog, allowing you to retrieve all catalogs related to a given search.

Prepare your data for later use.

It's pointless to identify and catalog our data if we can't make it available to third parties in the way they need it. This step is key, as providing data in the required formats will facilitate its use, simplifying the analysis and development processes of new solutions.

IDFS makes this process easier by creating "Recipes," a name that fits perfectly since what we're going to do is "cook" our data.

As with any good recipe, our ingredients (the data) will go through several steps that will allow us to finally prepare the dish to our liking.

Prepare your data (Staging)

The first step in any recipe is to gather all the necessary ingredients. This is the preparation or staging step. This step allows you to choose from your entire catalog the one that contains the required information.

Transform your data (Transformation)

Any Data Fabric worth its salt must be able to transform data sources and must have the capacity to do so quickly and effectively.

IDFS allows data to be conditioned through the necessary transformations so that the client can understand the data.

These transformations can be of various types: string replacement, rounding of values, SQL expressions that transform data, etc. All of these data transformations will be persisted directly to the IRIS database without affecting the data source at any time.

After this step, our data would be adapted to the requirements of the client system that will use it.

Data Validation

In a Data Fabric, it's not enough to simply transform the data; it's necessary to ensure that the data being provided to third parties is accurate.

IDFS has a data validation step that allows us to filter the data we provide to our clients. Data that doesn't meet validation will generate warnings or alerts to be managed by the responsible person.

An important point of this validation phase in IDFS is that it can also be applied to the fields we transformed in the previous step.

Data Reconciliation

It is very common to need to validate our data with an external source to ensure that the data in our Data Fabric is consistent with the information available in other tables in our data source.

IDFS has a reconciliation process that allows us to compare our validated data with this external data source, thereby ensuring its validity.

Every Data Fabric must be able to forward all the information that has passed through it to third-party systems. To do this, it must have processes that export this transformed and validated data.

IDFS allows you to promote data that has gone through all the previous steps to a previously defined data source. This promotion is done through a simple process in which we define the following:

The data source to which we will send the information.
The target schema (related to a table in the data source).
The mapping between our transformed and validated data and the destination table.

Once the previous configuration is complete, our recipe is ready to go into action whenever we want. To do so, we only need to take one last step: schedule the execution of our recipe.

Business scheduler

Let's do a quick review before continuing what we've done:

Define our data sources.
Import the relevant catalogs.
Create a recipe to cook our data.
Configure the import, transformation, validation, and promotion of our data to an external database.

As you can see, all that's left is to define when we want our recipe to run. Let's get to it!

In a very simple way, we can indicate when we want the steps defined in our recipe to be executed, either on a scheduled basis, at the end of a previous execution, manually, etc.

These execution scheduling capabilities will allow us to seamlessly chain recipe executions, thereby streamlining their execution and giving us more detailed control over what's happening with our data.

Each execution of our recipes will leave a record that we will be able to consult later to know the status of said execution:

The executions will generate a series of reports that are easily searchable and downloadable. Each report will show the results of each of the steps defined in our recipe:

Conclusions

We've reached the end of the article. I hope it helped you better understand the concept of Data Fabric and that you found our new InterSystems Data Fabric Studio solution interesting.

Thank you for your time!

Go to the original post written by @Luis Angel Pérez Ramos