Read a parquet file to a JSON file and load in your IRIS repository |

Article

Yuri Marx · Nov 27, 2023 3m read

Open Exchange

#Big Data #HealthShare #InterSystems IRIS #InterSystems IRIS for Health

According to Databricks Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC. (source: https://www.databricks.com/glossary/what-is-parquet). Below are the characteristics and benefits of Parquet according to Databricks:

Characteristics of Parquet

Free and open source file format.
Language agnostic.
Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.
Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.
Highly efficient data compression and decompression.
Supports complex data types and advanced nested data structures.

Benefits of Parquet

Good for storing big data of any kind (structured data tables, images, videos, documents).
Saves on cloud storage space by using highly efficient column-wise compression, and flexible encoding schemes for columns with different data types.
Increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data.

A standard as important as this could not be left out of InterSystems IRIS, the best Data Fabric on the market. Therefore, it is now possible to use the iris parquet application (https://openexchange.intersystems.com/package/iris-parquet) to read and write parquet data.

Procedures to install

Installation with Docker

1. Clone/git pull the repo into any local directory:

$ git clone https://github.com/yurimarx/iris-parquet.git

2. Open the terminal in this directory and call the command to build and run InterSystems IRIS in container:

$ docker-compose build
$ docker-compose up -d

Installation with ZPM

1. Execute on terminal:

USER> zpm install iris-parquet

2. Install hadoop files and set ENV variable to HADOOP_HOME:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && \
    tar -xzf hadoop-3.3.6.tar.gz && \
    echo "export HADOOP_HOME=//hadoop-3.3.6"

Write Parquet from SQL

There are two options, from ObjectScript or from REST API:

1. From ObjectScript (sample: change with your values):

Set result = ##class(dc.irisparquet.IrisParquet).SQLToParquet(
        "personSchema",
        "persons",
        "jdbc:IRIS://localhost:1972/IRISAPP",
        "SELECT * FROM dc_irisparquet.SamplePerson",
        "/tmp/sample.parquet"
    )

2. From REST API:

Read Parquet to JSON

There are two options, from ObjectScript or from REST API:

1. From ObjectScript (sample: change with your values):

Set result = ##class(dc.irisparquet.IrisParquet).ParquetToJSON(
        "/tmp/"_source.FileName,
        "/tmp/content.json"
        )

2. From REST API: