Keyword: Pandas DataFrame, IRIS, Python, JDBC
Pandas DataFrame is popular tool for EDA (Exploratory Data Analysis). In ML tasks, the first thing we usually perform is to understand the data a bit more. Last week I was trying this Covid19 dataset in Kaggle. Basically the data is a spreadsheet of 1925 encounter rows with 231 columns, and the task is simply to predict whether a patient (linked to 1 or more encounter records) would be admitted to ICU. So it's a normal classification task, and we would as usual use padas.DataFrame to take a quick look first.
We have IRIS IntegrtedML now, which has a sleek SQL wrapper around a powerful options of "AutoML(s)" , so frequently I'd like to save various stage of dataframes into IRIS database tables, then run IntegratedML upon them with comparisons to tradition ML algorithms. However, since the dataframe.to_sql() is not not working with IRIS yet, it turned out actually I spent most of time fiddling with other data saving methods. It felt like you wanted to cook a nice omelette for an enjoyable Saturday morning breakfast but you turned out doing the gas and sink plumbings under the hob all the day. It's not perfect but I think I'd better record a brief note here in case it's gone after few weeks again.
No, I didn't make dataframe.to_sql() work with IRIS, sorry we are not not there yet, however I did spent some time making a quick Python function as simple and raw as possible to dynamically save a dataframe into IRIS directly via JDBC (JayDeBeApi). It might not be working with PyODBC yet due to a known issue ("MemoryError").
I am testing with this IntegratedML template via a simple docker-compose as a base for the following scripts. The GitHub repo includes an environment topology. I am using this JDBC Jupyter Notebook to connect into the IRIS container.
1. Define a Python Function to emulate dataframe.to_sql()
I ran this in a Notebook cell:
def to_sql_iris(cursor, dataFrame, tableName, schemaName='SQLUser', drop_table=False ): """" Dynamically insert dataframe into an IRIS table via SQL by "excutemany" Inputs: cursor: Python JDBC or PyODBC cursor from a valid and establised DB connection dataFrame: Pandas dataframe tablename: IRIS SQL table to be created, inserted or apended schemaName: IRIS schemaName, default to "SQLUser" drop_table: If the table already exsits, drop it and re-create it if True; othrewise keep it and appen Output: True is successful; False if there is any exception. """ if drop_table: try: curs.execute("DROP TABLE %s.%s" %(schemaName, tableName)) except Exception: pass try: dataFrame.columns = dataFrame.columns.str.replace("[() -]", "_") curs.execute(pd.io.sql.get_schema(dataFrame, tableName)) except Exception: pass curs.fast_executemany = True cols = ", ".join([str(i) for i in dataFrame.columns.tolist()]) wildc =''.join('?, ' * len(dataFrame.columns)) wildc = '(' + wildc[:-2] + ')' sql = "INSERT INTO " + tableName + " ( " + cols.replace('-', '_') + " ) VALUES" + wildc #print(sql) curs.executemany(sql, list(dataFrame.itertuples(index=False, name=None)) ) return True
Basically it is trying to dynamically insert a dataframe into an IRIS table. If the table already exists, then the full dataframe will be appended at its end, otherwise a new table will be created based on the dimensions(column names and data types) of the dataframe and its full content will be inserted. It is simply using "
2. Test - Read the raw datafile into dataframe
Run the following in Notebook, to load the raw data from local drive into a dataframe. The raw data can be simply downloaded from this Kaggle site.
import numpy as np import pandas as pd from sklearn.impute import SimpleImputer import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, roc_auc_score, roc_curve import seaborn as sns sns.set(style="whitegrid")
import os for dirname, _, filenames in os.walk('./input'): for filename in filenames: print(os.path.join(dirname, filename)) df = pd.read_excel("./input/raw_data_kaggle_covid_icu.xlsx")
3. Test - Connect into IRIS DB via Python over JDBC:
import jaydebeapi url = "jdbc:IRIS://irisimlsvr:51773/USER" driver = 'com.intersystems.jdbc.IRISDriver' user = "SUPERUSER" password = "SYS" jarfile = "./intersystems-jdbc-3.1.0.jar"
conn = jaydebeapi.connect(driver, url, [user, password], jarfile) curs = conn.cursor()
4. Test - Save the dataframe into an IRIS table
iris_schema = 'SQLUser' iris_table = 'Covid19RawTableFromCSV'
to_sql_iris(curs, df, iris_table, iris_schema, drop_table=True) # save it into a new IRIS table of specified name #to_sql_iris(curs, df, iris_table) # append dataframe to an exsiting IRIS table
import pandas as pd from IPython.display import display
df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn) display(df2)
So the full data is inserted into an IRIS table called "
Covid19RawTableFromCSV". If logging into IRIS Management Portal, we should see the table too with records.
5. Test - run a bit benchmark:
Let's insert this dataframe e.g. 10x times, to see how long it took in this basic CE docker with single JDBC session:
from tqdm import tqdm import pandas as pd import time from IPython.display import display
start = time.clock() for i in tqdm(range(0, 10)): to_sql_iris(curs, df, iris_table) print("Total time elasped: ", time.clock()-start, " for importing total records:") df2 = pd.read_sql("SELECT COUNT(*) from %s.%s" %(iris_schema, iris_table),conn) display(df2)
100%|██████████| 10/10 [00:14<00:00, 1.42s/it]
Total time elasped: 12.612431999999998 for importing total records:
That's it, although very basic, at least I now can save whatever manipulated dataframes along the data analysis pipline, and able to invoke IntegratedML via its SQL interface on them now, for a bit real ML attempts.
Some caveats to be mentioned here: The string in DataFrame is sometimes interpreted as "object", which needs to be converted to string before being inserted into IRIS table, something like df['column'].astype(str) for example. "DROP TABLE" is used for overwriting previous tables. "DROP VIEW" can be used to delete previous views.