Using IRIS and Presto for high-performance and scalable SQL queries |

Article

Yuri Marx · Nov 27, 2024 8m read

Open Exchange

#Analytics #Performance #SQL #InterSystems IRIS #InterSystems IRIS for Health

The rise of Big Data projects, real-time self-service analytics, online query services, and social networks, among others, have enabled scenarios for massive and high-performance data queries. In response to this challenge, MPP (massively parallel processing database) technology was created, and it quickly established itself. Among the open-source MPP options, Presto (https://prestodb.io/) is the best-known option. It originated in Facebook and was utilized for data analytics, but later became open-sourced. However, since Teradata has joined the Presto community, it offers support now.

Presto connects to transactional data sources (Oracle, DB2, MySQL, PostgreSQL, MongoDB, and other SQL and NoSQL databases) and delivers distributed and in-memory SQL processing, combined with automatic optimizations of execution plans. Its objective, above all, is to execute fast queries regardless of whether you handle gigabytes or terabytes of data, scaling and parallelizing workloads.

Presto did not originally have a native connector for the IRIS Database, but fortunately, this issue was resolved with an InterSystems community project presto-iris (https://openexchange.intersystems.com/package/presto-iris). That is why we can now expose an MPP layer in front of InterSystems IRIS repositories to enable high-performance queries, reports, and dashboards from transactional data in IRIS.

In this article, we will follow a step-by-step guide on configuring Presto, connecting it to IRIS, and establishing an MPP layer for your customers to employ. We will also demonstrate the main features of Presto, its main commands and tools, always with IRIS as the source database.

Presto Features

Presto contains the following features:

Simple but extensible architecture.
Pluggable connectors (Presto supports pluggable connectors to provide metadata and data for queries).
Pipelined executions (it avoids unnecessary I/O latency overhead).
User-defined functions (analysts can create custom user-defined functions to migrate easily).
Vectorized columnar processing.

Benefits of Presto

Below, you can see a list of benefits that Apache Presto offers:

Specialized SQL operations;
Easy installation and debugging;
Simple storage abstraction;
Quickly scalable petabytes of data with low latency.

Presto Architecture

Clients: They are PrestoDB consumers. Clients use the JDBC/ODBC/REST protocol to talk to coordinators.
Coordinators: They are responsible for managing associated worker nodes, parsing, handling queries, and generating execution plans. They are also in charge of data delivery for processing between workers, which creates logical plans composed of stages, where every stage is executed in a distributed way using tasks across workers.
Workers: They are compute nodes for executing tasks and processing data, allowing scale data processing and consumption.
Communication: Each Presto worker advertises the coordinator using a discovery server to prepare for work.
Connectors: Each possible data source type has a connector employed by Presto to consume data. The project https://openexchange.intersystems.com/package/presto-iris allows Presto to utilize InterSystems IRIS.
Catalog: It contains information about the location of data including schemas and the data source. When users run an SQL statement in Presto, they run it against one or more catalogs.

Presto use cases

The InterSystems IRIS and Presto together allow you the following use cases:

Ad-hoc queries: You can run ad-hoc queries with high performance against terabytes of data.
Reporting and dashboarding: There is an engine to deliver high-performance data queries for reports, self-service BI, and analytics tools, e.g., Apache Superset (check out the sample in this article).
Open lakehouse: Presto has the connectors and catalogs to unify required data sources and deliver scalable queries and data using SQL between workers.

InterSystems IRIS is a perfect partner for Presto. Since it is a high-performance data repository that supports distributed processing with the help of shards and associated with Presto workers, any volume of data can be queried in just a few milliseconds.

Installing and running the PrestoDB

There are various options (Docker and Java JAR) of how to install Presto. You can find more details related to it on https://prestodb.io/docs/current/installation/deployment.html. In this article we will utilize Docker. To facilitate understanding and enable a quick start, we have made a sample application available on Open Exchange (it was derived from another package https://openexchange.intersystems.com/package/presto-iris). Take the following steps to see for yourself:

Go to https://openexchange.intersystems.com/package/iris-presto-sample and download the sample used in this tutorial.
Start the demo environment with docker-compose:
```
docker-compose up -d --build
```
Note: For the demo purpose, it operates Apache Superset with superset-iris and examples that come with it. For that reason, it takes a while for it to load.
Presto UI will become available via this link: http://localhost:8080/ui/#.
Wait 15 to 20 minutes (there is a lot of sample data to load). When SuperSet finishes loading examples after 10-15 minutes, it should become available at the link http://localhost:8088/databaseview/list (type admin/admin as username/password on the login page).
Now go to Dashboards:
If we visit http://localhost:8080/ui, we can notice that Presto has executed queries and is displaying some statistics:

Above you can see the Presto web interface for monitoring and managing queries. It can be accessed from the port number specified in the coordinator Config Properties (for this article it is 8080).

Details about the sample code

Dockerfile

The Dockerfile is employed to create a PrestoDB Docker image with the presto-iris plugin and the InterSystems IRIS JDBC file included:

# Official PrestoDB image on Docker Hub
FROM prestodb/presto

# From https://github.com/caretdev/presto-iris/releases
# Adding presto-iris plugin into the Docker image
ADD https://github.com/caretdev/presto-iris/releases/download/0.1/presto-iris-0.1-plugin.tar.gz /tmp/presto-iris/presto-iris-0.1-plugin.tar.gz

# From https://github.com/intersystems-community/iris-driver-distribution
# Adding IRIS JDBC driver into the Docker image
ADD https://raw.githubusercontent.com/intersystems-community/iris-driver-distribution/refs/heads/main/JDBC/JDK18/com/intersystems/intersystems-jdbc/3.8.4/intersystems-jdbc-3.8.4.jar /opt/presto-server/plugin/iris/intersystems-jdbc-3.8.4.jar

RUN --mount=type=bind,src=.,dst=/tmp/presto-iris \
    tar -zxvf /tmp/presto-iris/presto-iris-0.1-plugin.tar.gz -C /opt/presto-server/plugin/iris/ --strip-components=1

Docker-compose.yml file

This file creates 3 container instances: one for InterSystems IRIS (IRIS service), one for PrestoDB (Presto service), and one for the Superset (Superset service). The Superset is an Analytics visualization tool utilized for seeing data in Dashboards.

# from the project https://github.com/caretdev/presto-iris
services:
  # create an InterSystems IRIS container instance
  iris:
    image: intersystemsdc/iris-community
    ports:
      - 1972
      - 52773
    environment:
      IRIS_USERNAME: _SYSTEM
      IRIS_PASSWORD: SYS
  # create a PrestoDB container instance consuming the IRIS database
  presto:
    build: .
    volumes:
      # PrestoDB will use iris.properties to get connection information
       - ./iris.properties:/opt/presto-server/etc/catalog/iris.properties
    ports:
      - 8080:8080
  # create a Superset (Dashboard analytics tool) container instance
  superset:
    image: apache/superset:3.0.2
    platform: linux/amd64
    environment:
      SUPERSET_SECRET_KEY: supersecret
      # create an InterSystems IRIS connection to load sample data
      SUPERSET_SQLALCHEMY_EXAMPLES_URI: iris://_SYSTEM:SYS@iris:1972/USER
    volumes:
      - ./superset_entrypoint.sh:/superset_entrypoint.sh
      - ./superset_config.py:/app/pythonpath/superset_config.py
    ports:
      - 8088:8088
    entrypoint: /superset_entrypoint.sh

The iris.properties file

This file has the information required to connect PrestoDB to InterSystems IRIS DB and create an MPP layer for high-performance and scalable queries from Superset dashboards.

# from the project https://github.com/caretdev/presto-iris
connector.name=iris
connection-url=jdbc:IRIS://iris:1972/USER
connection-user=_SYSTEM
connection-password=SYS

The superset_entrypoint.sh file

This script installs the superset-iris library (for Superset support of IRIS), inits the Superset instance, and loads sample data into InterSystems IRIS DB. At runtime the data consumed by Superset will come from PrestoDB, which will be an MPP layer for IRIS DB.

#!/bin/bash

# Install the InterSystems IRIS Superset extension
pip install superset-iris

superset db upgrade

superset fab create-admin \
            --username admin \
            --firstname Superset \
            --lastname Admin \
            --email admin@superset.com \
            --password ${ADMIN_PASSWORD:-admin}
superset init

# Load examples to IRIS
superset load-examples

# Change examples database URI to Presto
superset set-database-uri -d examples -u presto://presto:8080/iris

/usr/bin/run-server.sh

About Superset

This article used Superset for dashboards.It is a modern data exploration and visualization platform that can replace or augment proprietary business intelligence tools for many teams.Superset integrates well with a variety of data sources.
Superset provides the following:

A no-code interface for building charts quickly
A powerful web-based SQL Editor for advanced querying
A lightweight semantic layer for defining custom dimensions and metrics fast
Out-of-the-box support for nearly any SQL database or data engine
A wide array of beautiful visualizations to showcase your data, ranging from simple bar charts to geospatial visualizations
Lightweight configurable caching layer to help ease database load
Highly extensible security roles and authentication options
An API for programmatic customization
A cloud-native architecture designed from the ground up for scale

Sources and additional learning materials

Extensive tutorial about PrestoDB: https://www.tutorialspoint.com/apache_presto/apache_presto_quick_guide.htm
PrestoDB documentation: https://prestodb.io/docs/current/overview.html
Presto-iris plugin: https://openexchange.intersystems.com/package/presto-iris
Iris-presto sample: https://openexchange.intersystems.com/package/iris-presto-sample
About Superset: https://github.com/apache/superset
Superset with InterSystems IRIS: https://openexchange.intersystems.com/package/superset-iris