Save Pandas DataFrame into IRIS - a quick note
Keyword: Pandas DataFrame, IRIS, Python, JDBC
Purpose
Pandas DataFrame is popular tool for EDA (Exploratory Data Analysis). In ML tasks, the first thing we usually perform is to understand the data a bit more. Last week I was trying this Covid19 dataset in Kaggle. Basically the data is a spreadsheet of 1925 encounter rows with 231 columns, and the task is simply to predict whether a patient (linked to 1 or more encounter records) would be admitted to ICU. So it's a normal classification task, and we would as usual use padas.DataFrame to take a quick look first.
We have IRIS IntegrtedML now, which has a sleek SQL wrapper around a powerful options of "AutoML(s)" , so frequently I'd like to save various stage of dataframes into IRIS database tables, then run IntegratedML upon them with comparisons to tradition ML algorithms. However, since the dataframe.to_sql() is not not working with IRIS yet, it turned out actually I spent most of time fiddling with other data saving methods. It felt like you wanted to cook a nice omelette for an enjoyable Saturday morning breakfast but you turned out doing the gas and sink plumbings under the hob all the day. It's not perfect but I think I'd better record a brief note here in case it's gone after few weeks again.
Scope
No, I didn't make dataframe.to_sql() work with IRIS, sorry we are not not there yet, however I did spent some time making a quick Python function as simple and raw as possible to dynamically save a dataframe into IRIS directly via JDBC (JayDeBeApi). It might not be working with PyODBC yet due to a known issue ("MemoryError").
Environment
I am testing with this IntegratedML template via a simple docker-compose as a base for the following scripts. The GitHub repo includes an environment topology. I am using this JDBC Jupyter Notebook to connect into the IRIS container.
Test
1. Define a Python Function to emulate dataframe.to_sql()
I ran this in a Notebook cell:
def to_sql_iris(cursor, dataFrame, tableName, schemaName='SQLUser', drop_table=False ):
""""
Dynamically insert dataframe into an IRIS table via SQL by "excutemany"
Inputs:
cursor: Python JDBC or PyODBC cursor from a valid and establised DB connection
dataFrame: Pandas dataframe
tablename: IRIS SQL table to be created, inserted or apended
schemaName: IRIS schemaName, default to "SQLUser"
drop_table: If the table already exsits, drop it and re-create it if True; othrewise keep it and appen
Output:
True is successful; False if there is any exception.
"""
if drop_table:
try:
curs.execute("DROP TABLE %s.%s" %(schemaName, tableName))
except Exception:
pass
try:
dataFrame.columns = dataFrame.columns.str.replace("[() -]", "_")
curs.execute(pd.io.sql.get_schema(dataFrame, tableName))
except Exception:
pass
curs.fast_executemany = True
cols = ", ".join([str(i) for i in dataFrame.columns.tolist()])
wildc =''.join('?, ' * len(dataFrame.columns))
wildc = '(' + wildc[:-2] + ')'
sql = "INSERT INTO " + tableName + " ( " + cols.replace('-', '_') + " ) VALUES" + wildc
#print(sql)
curs.executemany(sql, list(dataFrame.itertuples(index=False, name=None)) )
return True
Basically it is trying to dynamically insert a dataframe into an IRIS table. If the table already exists, then the full dataframe will be appended at its end, otherwise a new table will be created based on the dimensions(column names and data types) of the dataframe and its full content will be inserted. It is simply using "executemany" method.
2. Test - Read the raw datafile into dataframe
Run the following in Notebook, to load the raw data from local drive into a dataframe. The raw data can be simply downloaded from this Kaggle site.
import numpy as np import pandas as pd from sklearn.impute import SimpleImputer import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, roc_auc_score, roc_curve import seaborn as sns sns.set(style="whitegrid")
import os for dirname, _, filenames in os.walk('./input'): for filename in filenames: print(os.path.join(dirname, filename)) df = pd.read_excel("./input/raw_data_kaggle_covid_icu.xlsx")
df
|
PATIENT_VISIT_IDENTIFIER |
AGE_ABOVE65 |
AGE_PERCENTIL |
GENDER |
DISEASE GROUPING 1 |
DISEASE GROUPING 2 |
DISEASE GROUPING 3 |
DISEASE GROUPING 4 |
DISEASE GROUPING 5 |
DISEASE GROUPING 6 |
... |
TEMPERATURE_DIFF |
OXYGEN_SATURATION_DIFF |
BLOODPRESSURE_DIASTOLIC_DIFF_REL |
BLOODPRESSURE_SISTOLIC_DIFF_REL |
HEART_RATE_DIFF_REL |
RESPIRATORY_RATE_DIFF_REL |
TEMPERATURE_DIFF_REL |
OXYGEN_SATURATION_DIFF_REL |
WINDOW |
ICU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
0 |
1 |
60th |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
... |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
0-2 |
0 |
1 |
0 |
1 |
60th |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
... |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
2-4 |
0 |
2 |
0 |
1 |
60th |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
4-6 |
0 |
3 |
0 |
1 |
60th |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
... |
-1.000000 |
-1.000000 |
NaN |
NaN |
NaN |
NaN |
-1.000000 |
-1.000000 |
6-12 |
0 |
4 |
0 |
1 |
60th |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
... |
-0.238095 |
-0.818182 |
-0.389967 |
0.407558 |
-0.230462 |
0.096774 |
-0.242282 |
-0.814433 |
ABOVE_12 |
1 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
1920 |
384 |
0 |
50th |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
0-2 |
0 |
1921 |
384 |
0 |
50th |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
2-4 |
0 |
1922 |
384 |
0 |
50th |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
4-6 |
0 |
1923 |
384 |
0 |
50th |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
-1.000000 |
6-12 |
0 |
1924 |
384 |
0 |
50th |
1 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
... |
-0.547619 |
-0.838384 |
-0.701863 |
-0.585967 |
-0.763868 |
-0.612903 |
-0.551337 |
-0.835052 |
ABOVE_12 |
0 |
3. Test - Connect into IRIS DB via Python over JDBC:
import jaydebeapi url = "jdbc:IRIS://irisimlsvr:51773/USER" driver = 'com.intersystems.jdbc.IRISDriver' user = "SUPERUSER" password = "SYS" jarfile = "./intersystems-jdbc-3.1.0.jar"
conn = jaydebeapi.connect(driver, url, [user, password], jarfile) curs = conn.cursor()
4. Test - Save the dataframe into an IRIS table
iris_schema = 'SQLUser' iris_table = 'Covid19RawTableFromCSV'
to_sql_iris(curs, df, iris_table, iris_schema, drop_table=True) # save it into a new IRIS table of specified name #to_sql_iris(curs, df, iris_table) # append dataframe to an exsiting IRIS table
Out[4]: True
import pandas as pd from IPython.display import display
df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn) display(df2)
|
Aggregate_1 |
---|---|
0 | 1925 |
So the full data is inserted into an IRIS table called "Covid19RawTableFromCSV".
If logging into IRIS Management Portal, we should see the table too with records.
5. Test - run a bit benchmark:
Let's insert this dataframe e.g. 10x times, to see how long it took in this basic CE docker with single JDBC session:
from tqdm import tqdm import pandas as pd import time from IPython.display import display
start = time.clock() for i in tqdm(range(0, 10)): to_sql_iris(curs, df, iris_table) print("Total time elasped: ", time.clock()-start, " for importing total records:") df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn) display(df2)
100%|██████████| 10/10 [00:14<00:00, 1.42s/it]
Total time elasped: 12.612431999999998 for importing total records:
|
Aggregate_1 |
---|---|
0 |
19250 |
That's it, although very basic, at least I now can save whatever manipulated dataframes along the data analysis pipline, and able to invoke IntegratedML via its SQL interface on them now, for a bit real ML attempts.
Some caveats to be mentioned here: The string in DataFrame is sometimes interpreted as "object", which needs to be converted to string before being inserted into IRIS table, something like df['column'].astype(str) for example. "DROP TABLE" is used for overwriting previous tables. "DROP VIEW" can be used to delete previous views.