Parquet files and InterSystems IRIS

Article

Yuri Marx · Nov 20, 2023 3m read

Open Exchange

#Big Data #HealthShare #InterSystems IRIS #InterSystems IRIS for Health

In the world of Big Data, selecting the right file format is crucial for efficient data storage, processing, and analysis. With the massive amount of data generated every day, choosing the appropriate format can greatly impact the speed, cost, and accuracy of data processing tasks. There are several file formats available, each with its own set of advantages and disadvantages, making the decision of which one to use complex. Some of the popular Big Data file formats include CSV, JSON, Avro, ORC, and Parquet. The last one, Parquet, is a columnar storage format that is highly compressed and separated, making it ideal for Big Data problems. It is optimized for the paradigm Write Once Read Many (WORM) and is good for heavy workloads when reading portions of data. The format provides predicate pushdown and projection pushdown, reducing disk I/O and overall performance time. Parquet is self-describing, has built-in support in Spark, and can be read and written using the Avro API and Avro Schema. (source: https://www.linkedin.com/pulse/big-data-file-formats-which-one-right-fit...). Have you ever used pd.read_csv() in pandas? Well, that command could have run ~50x faster if you had used parquet instead of CSV. (source: https://towardsdatascience.com/demystifying-the-parquet-file-format-13ad...). The reasons and benefits of use Parquet are (source: https://www.linkedin.com/pulse/big-data-file-formats-which-one-right-fit...):

Optimized for Big Data, but readability and writing speed are quite poor.
Parquet is the best choice for performance when choosing a data storage format in Hadoop, considering factors like integration with third-party applications, schema evolution, and support for specific data types.
Compression algorithms play a significant role in reducing the amount of data and improving performance.
CSV is typically the fastest to write, JSON the easiest to understand for humans, and Parquet the fastest to read a subset of columns.
Columnar formats like Parquet are suitable for fast data ingestion, fast random data retrieval, and scalable data analytics.
Parquet is used for further analytics after preprocessing, where not all fields are required.

Currently, IRIS does not support the Parquet format, even though it is a very important standard for Data Fabric, Big Data, Data Ingestion and Data Interoperability projects. But now, it is possible, through the iris-parquet application (https://openexchange.intersystems.com/package/iris-parquet) to write or read data in IRIS in Parquet files. Just follow the steps below:

1. Install parquet-iris using Docker or ZPM.

If Docker:

docker-compose build
docker-compose up -d

If ZPM:

USER> zpm install iris-parquet

Install hadoop files and set ENV variable to HADOOP_HOME (for ZPM installation only):

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && \
    tar -xzf hadoop-3.3.6.tar.gz && \
    echo "export HADOOP_HOME=//hadoop-3.3.6"

2. Open http://localhost:<port>/swagger-ui/index.html and explore the parquet-api _spec endpoint:

3. Run the method /generate-persons one or more to generate sample person fake data:

4. Run the method /sql2parquet with this query on body: select * from dc_irisparquet.SamplePerson:

5. Download the parquet file on the link Download file:

6. Open the Parquet file on VSCode (install the parquet-viewer extension to see the parquet content from VSCode - https://marketplace.visualstudio.com/items?itemName=dvirtz.parquet-viewer):

Mike Cromwell · Dec 1, 2023

Excellent! Can you explain your inspiration for this project? Were you inspired 😉?

Have you considered a project to migrate parquet lake to Delta Lake?

0 0

Yuri Marx · Dec 12, 2023

Thanks Mike, my inspiration is a target to looking for IRIS current gaps on important themes and help to eliminate the gaps with new DC projects.

IRIS with sharding enabled is great option for Lake projects