Article
· Jul 18, 2020 7m read

Save Pandas DataFrame into IRIS - a quick note

Keyword: Pandas DataFrame, IRIS, Python, JDBC

Purpose

Pandas DataFrame is popular tool for EDA (Exploratory Data Analysis). In ML tasks, the first thing we usually perform is to understand the data a bit more. Last week I was trying this Covid19 dataset in Kaggle. Basically the data is a spreadsheet of 1925 encounter rows with 231 columns, and the task is simply to predict whether a patient (linked to  1 or more encounter records) would be admitted to ICU. So it's a normal classification task, and we would as usual use padas.DataFrame to take a quick look first.  

We have IRIS IntegrtedML now, which has a sleek SQL wrapper around a powerful options of "AutoML(s)" , so frequently I'd like to save various stage of dataframes into IRIS database tables, then run IntegratedML upon them with comparisons to tradition ML algorithms.  However, since the dataframe.to_sql() is not not working with IRIS yet, it turned out actually I spent most of time fiddling with other data saving methods. It felt like you wanted to cook a nice omelette for an enjoyable Saturday morning breakfast but you turned out doing the gas and sink plumbings under the hob all the day. It's not perfect but I think I'd better record a brief note here in case it's gone after few weeks again.

 

Scope

 No, I didn't make dataframe.to_sql() work with IRIS, sorry we are not not there yet, however I did spent some time making a quick Python function as simple and raw as possible to dynamically save a dataframe into IRIS directly via JDBC (JayDeBeApi).  It might not be working with PyODBC yet due to a known issue ("MemoryError").

 

Environment

I am testing with this IntegratedML template via a simple docker-compose as a base for the following scripts. The GitHub repo includes an environment topology. I am using this JDBC Jupyter Notebook to connect into the IRIS container. 

 

Test

 1. Define a Python Function to emulate dataframe.to_sql()

I ran this in a Notebook cell:

def to_sql_iris(cursor, dataFrame, tableName, schemaName='SQLUser', drop_table=False ):
        """"
        Dynamically insert dataframe into an IRIS table via SQL by "excutemany" 
        
        Inputs:
            cursor:      Python JDBC or PyODBC cursor from a valid and establised DB connection
            dataFrame:   Pandas dataframe
            tablename:   IRIS SQL table to be created, inserted or apended
            schemaName:  IRIS schemaName, default to "SQLUser"
            drop_table:  If the table already exsits, drop it and re-create it if True; othrewise keep it and appen 
        Output:
            True is successful; False if there is any exception.
        """
        if drop_table:   
            try:                 
                curs.execute("DROP TABLE %s.%s" %(schemaName, tableName))
            except Exception:
                pass
        
        try: 
            dataFrame.columns = dataFrame.columns.str.replace("[() -]", "_")
            curs.execute(pd.io.sql.get_schema(dataFrame, tableName))
        except Exception:
            pass
        
        curs.fast_executemany = True
        cols = ", ".join([str(i) for i in dataFrame.columns.tolist()])
        wildc =''.join('?, ' * len(dataFrame.columns))
        wildc = '(' + wildc[:-2] + ')'
        sql = "INSERT INTO " + tableName + " ( " + cols.replace('-', '_') + " ) VALUES" + wildc
        #print(sql)
        curs.executemany(sql, list(dataFrame.itertuples(index=False, name=None)) )
        return True

Basically it is trying to dynamically insert a dataframe into an IRIS table. If the table already exists, then the full dataframe will be appended at its end, otherwise a new table will be created based on the dimensions(column names and data types) of the dataframe and its full content will be inserted.  It is simply using  "executemany" method.

 

2. Test - Read the raw datafile into dataframe

Run the following in Notebook, to load the raw data from local drive into a dataframe. The raw data can be simply downloaded from this Kaggle site.

import numpy as np 
import pandas as pd 
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import seaborn as sns
sns.set(style="whitegrid")

import os
for dirname, _, filenames in os.walk('./input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
df = pd.read_excel("./input/raw_data_kaggle_covid_icu.xlsx")
df
./input/datasets_605991_1272346_Kaggle_Sirio_Libanes_ICU_Prediction.xlsx
Out[2]:
 
PATIENT_VISIT_IDENTIFIER
AGE_ABOVE65
AGE_PERCENTIL
GENDER
DISEASE GROUPING 1
DISEASE GROUPING 2
DISEASE GROUPING 3
DISEASE GROUPING 4
DISEASE GROUPING 5
DISEASE GROUPING 6
...
TEMPERATURE_DIFF
OXYGEN_SATURATION_DIFF
BLOODPRESSURE_DIASTOLIC_DIFF_REL
BLOODPRESSURE_SISTOLIC_DIFF_REL
HEART_RATE_DIFF_REL
RESPIRATORY_RATE_DIFF_REL
TEMPERATURE_DIFF_REL
OXYGEN_SATURATION_DIFF_REL
WINDOW
ICU
0
0
1
60th
0
0.0
0.0
0.0
0.0
1.0
1.0
...
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
0-2
0
1
0
1
60th
0
0.0
0.0
0.0
0.0
1.0
1.0
...
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
2-4
0
2
0
1
60th
0
0.0
0.0
0.0
0.0
1.0
1.0
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4-6
0
3
0
1
60th
0
0.0
0.0
0.0
0.0
1.0
1.0
...
-1.000000
-1.000000
NaN
NaN
NaN
NaN
-1.000000
-1.000000
6-12
0
4
0
1
60th
0
0.0
0.0
0.0
0.0
1.0
1.0
...
-0.238095
-0.818182
-0.389967
0.407558
-0.230462
0.096774
-0.242282
-0.814433
ABOVE_12
1
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1920
384
0
50th
1
0.0
0.0
0.0
0.0
0.0
0.0
...
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
0-2
0
1921
384
0
50th
1
0.0
0.0
0.0
0.0
0.0
0.0
...
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
2-4
0
1922
384
0
50th
1
0.0
0.0
0.0
0.0
0.0
0.0
...
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
4-6
0
1923
384
0
50th
1
0.0
0.0
0.0
0.0
0.0
0.0
...
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
-1.000000
6-12
0
1924
384
0
50th
1
0.0
0.0
1.0
0.0
0.0
0.0
...
-0.547619
-0.838384
-0.701863
-0.585967
-0.763868
-0.612903
-0.551337
-0.835052
ABOVE_12
0
1925 rows × 231 columns

 

3. Test - Connect into IRIS DB via Python over JDBC:

import jaydebeapi
url = "jdbc:IRIS://irisimlsvr:51773/USER" 
driver = 'com.intersystems.jdbc.IRISDriver'
user = "SUPERUSER"
password = "SYS"
jarfile = "./intersystems-jdbc-3.1.0.jar"

conn = jaydebeapi.connect(driver, url, [user, password], jarfile)
curs = conn.cursor()

4. Test - Save the dataframe into an IRIS table

iris_schema = 'SQLUser'
iris_table = 'Covid19RawTableFromCSV'

to_sql_iris(curs, df, iris_table, iris_schema, drop_table=True)   # save it into a new IRIS table of specified name
#to_sql_iris(curs, df, iris_table)  # append dataframe to an exsiting IRIS table

Out[4]: True

import pandas as pd
from IPython.display import display

df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn)
display(df2)
   

 

Aggregate_1
0 1925

So the full data is inserted into an IRIS table called "Covid19RawTableFromCSV". If logging into IRIS Management Portal, we should see the table too with records.

 

5. Test - run a bit benchmark:

Let's insert this dataframe e.g. 10x times, to see how long it took in this basic CE docker  with single JDBC session:

from tqdm import tqdm
import pandas as pd
import time
from IPython.display import display

start = time.clock() 
for i in tqdm(range(0, 10)):
    to_sql_iris(curs, df, iris_table)
    
print("Total time elasped: ", time.clock()-start, " for importing total records:")
df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn)
display(df2)
100%|██████████| 10/10 [00:14<00:00,  1.42s/it]
Total time elasped:  12.612431999999998  for importing total records:

 

Aggregate_1

0

19250

 

That's it, although very basic, at least I now can save whatever manipulated dataframes along the data analysis pipline, and able to invoke IntegratedML via its SQL interface  on them now, for a bit real ML attempts.  

Some caveats to be mentioned here:  The string in DataFrame is sometimes interpreted as "object", which needs to be converted to string before being inserted into IRIS table, something like  df['column'].astype(str) for example.  "DROP TABLE" is used for overwriting previous tables. "DROP VIEW" can be used to delete previous views. 

Discussion (3)0
Log in or sign up to continue

Iterating through large pandas dataFrame objects is generally slow. Pandas iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method.

Pandas DataFrame loop using list comprehension example

result = [(x, y,z) for x, y,z in zip(df['column_1'], df['column_2'],df['column_3'])]