Save Pandas DataFrame into IRIS - a quick note

Article

Zhong Li · Jul 18, 2020 7m read

Open Exchange

#JDBC #Machine Learning #Python #InterSystems IRIS

Keyword: Pandas DataFrame, IRIS, Python, JDBC

Purpose

Pandas DataFrame is popular tool for EDA (Exploratory Data Analysis). In ML tasks, the first thing we usually perform is to understand the data a bit more. Last week I was trying this Covid19 dataset in Kaggle. Basically the data is a spreadsheet of 1925 encounter rows with 231 columns, and the task is simply to predict whether a patient (linked to 1 or more encounter records) would be admitted to ICU. So it's a normal classification task, and we would as usual use padas.DataFrame to take a quick look first.

We have IRIS IntegrtedML now, which has a sleek SQL wrapper around a powerful options of "AutoML(s)" , so frequently I'd like to save various stage of dataframes into IRIS database tables, then run IntegratedML upon them with comparisons to tradition ML algorithms. However, since the dataframe.to_sql() is not not working with IRIS yet, it turned out actually I spent most of time fiddling with other data saving methods. It felt like you wanted to cook a nice omelette for an enjoyable Saturday morning breakfast but you turned out doing the gas and sink plumbings under the hob all the day. It's not perfect but I think I'd better record a brief note here in case it's gone after few weeks again.

Scope

No, I didn't make dataframe.to_sql() work with IRIS, sorry we are not not there yet, however I did spent some time making a quick Python function as simple and raw as possible to dynamically save a dataframe into IRIS directly via JDBC (JayDeBeApi). It might not be working with PyODBC yet due to a known issue ("MemoryError").

Environment

I am testing with this IntegratedML template via a simple docker-compose as a base for the following scripts. The GitHub repo includes an environment topology. I am using this JDBC Jupyter Notebook to connect into the IRIS container.

Test

1. Define a Python Function to emulate dataframe.to_sql()

I ran this in a Notebook cell:

def to_sql_iris(cursor, dataFrame, tableName, schemaName='SQLUser', drop_table=False ):
        """"
        Dynamically insert dataframe into an IRIS table via SQL by "excutemany" 
        
        Inputs:
            cursor:      Python JDBC or PyODBC cursor from a valid and establised DB connection
            dataFrame:   Pandas dataframe
            tablename:   IRIS SQL table to be created, inserted or apended
            schemaName:  IRIS schemaName, default to "SQLUser"
            drop_table:  If the table already exsits, drop it and re-create it if True; othrewise keep it and appen 
        Output:
            True is successful; False if there is any exception.
        """
        if drop_table:   
            try:                 
                curs.execute("DROP TABLE %s.%s" %(schemaName, tableName))
            except Exception:
                pass
        
        try: 
            dataFrame.columns = dataFrame.columns.str.replace("[() -]", "_")
            curs.execute(pd.io.sql.get_schema(dataFrame, tableName))
        except Exception:
            pass
        
        curs.fast_executemany = True
        cols = ", ".join([str(i) for i in dataFrame.columns.tolist()])
        wildc =''.join('?, ' * len(dataFrame.columns))
        wildc = '(' + wildc[:-2] + ')'
        sql = "INSERT INTO " + tableName + " ( " + cols.replace('-', '_') + " ) VALUES" + wildc
        #print(sql)
        curs.executemany(sql, list(dataFrame.itertuples(index=False, name=None)) )
        return True

Basically it is trying to dynamically insert a dataframe into an IRIS table. If the table already exists, then the full dataframe will be appended at its end, otherwise a new table will be created based on the dimensions(column names and data types) of the dataframe and its full content will be inserted. It is simply using "executemany" method.

2. Test - Read the raw datafile into dataframe

Run the following in Notebook, to load the raw data from local drive into a dataframe. The raw data can be simply downloaded from this Kaggle site.

import numpy as np 
import pandas as pd 
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import seaborn as sns
sns.set(style="whitegrid")

import os
for dirname, _, filenames in os.walk('./input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
df = pd.read_excel("./input/raw_data_kaggle_covid_icu.xlsx")

df

./input/datasets_605991_1272346_Kaggle_Sirio_Libanes_ICU_Prediction.xlsx

Out[2]:

	PATIENT_VISIT_IDENTIFIER	AGE_ABOVE65	AGE_PERCENTIL	GENDER	DISEASE GROUPING 1	DISEASE GROUPING 2	DISEASE GROUPING 3	DISEASE GROUPING 4	DISEASE GROUPING 5	DISEASE GROUPING 6	...	TEMPERATURE_DIFF	OXYGEN_SATURATION_DIFF	BLOODPRESSURE_DIASTOLIC_DIFF_REL	BLOODPRESSURE_SISTOLIC_DIFF_REL	HEART_RATE_DIFF_REL	RESPIRATORY_RATE_DIFF_REL	TEMPERATURE_DIFF_REL	OXYGEN_SATURATION_DIFF_REL	WINDOW	ICU
0	0	1	60th	0	0.0	0.0	0.0	0.0	1.0	1.0	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	0-2	0
1	0	1	60th	0	0.0	0.0	0.0	0.0	1.0	1.0	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	2-4	0
2	0	1	60th	0	0.0	0.0	0.0	0.0	1.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4-6	0
3	0	1	60th	0	0.0	0.0	0.0	0.0	1.0	1.0	...	-1.000000	-1.000000	NaN	NaN	NaN	NaN	-1.000000	-1.000000	6-12	0
4	0	1	60th	0	0.0	0.0	0.0	0.0	1.0	1.0	...	-0.238095	-0.818182	-0.389967	0.407558	-0.230462	0.096774	-0.242282	-0.814433	ABOVE_12	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1920	384	0	50th	1	0.0	0.0	0.0	0.0	0.0	0.0	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	0-2	0
1921	384	0	50th	1	0.0	0.0	0.0	0.0	0.0	0.0	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	2-4	0
1922	384	0	50th	1	0.0	0.0	0.0	0.0	0.0	0.0	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	4-6	0
1923	384	0	50th	1	0.0	0.0	0.0	0.0	0.0	0.0	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	6-12	0
1924	384	0	50th	1	0.0	0.0	1.0	0.0	0.0	0.0	...	-0.547619	-0.838384	-0.701863	-0.585967	-0.763868	-0.612903	-0.551337	-0.835052	ABOVE_12	0

1925 rows × 231 columns

3. Test - Connect into IRIS DB via Python over JDBC:

import jaydebeapi
url = "jdbc:IRIS://irisimlsvr:51773/USER" 
driver = 'com.intersystems.jdbc.IRISDriver'
user = "SUPERUSER"
password = "SYS"
jarfile = "./intersystems-jdbc-3.1.0.jar"

conn = jaydebeapi.connect(driver, url, [user, password], jarfile)
curs = conn.cursor()

4. Test - Save the dataframe into an IRIS table

iris_schema = 'SQLUser'
iris_table = 'Covid19RawTableFromCSV'

to_sql_iris(curs, df, iris_table, iris_schema, drop_table=True)   # save it into a new IRIS table of specified name
#to_sql_iris(curs, df, iris_table)  # append dataframe to an exsiting IRIS table

Out[4]: True

import pandas as pd
from IPython.display import display

df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn)
display(df2)

	Aggregate_1
0	1925

So the full data is inserted into an IRIS table called "Covid19RawTableFromCSV". If logging into IRIS Management Portal, we should see the table too with records.

5. Test - run a bit benchmark:

Let's insert this dataframe e.g. 10x times, to see how long it took in this basic CE docker with single JDBC session:

from tqdm import tqdm
import pandas as pd
import time
from IPython.display import display

start = time.clock() 
for i in tqdm(range(0, 10)):
    to_sql_iris(curs, df, iris_table)
    
print("Total time elasped: ", time.clock()-start, " for importing total records:")
df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn)
display(df2)

100%|██████████| 10/10 [00:14<00:00,  1.42s/it]

Total time elasped:  12.612431999999998  for importing total records:

	Aggregate_1
0	19250

That's it, although very basic, at least I now can save whatever manipulated dataframes along the data analysis pipline, and able to invoke IntegratedML via its SQL interface on them now, for a bit real ML attempts.

Some caveats to be mentioned here: The string in DataFrame is sometimes interpreted as "object", which needs to be converted to string before being inserted into IRIS table, something like df['column'].astype(str) for example. "DROP TABLE" is used for overwriting previous tables. "DROP VIEW" can be used to delete previous views.