Python JDBC connection into IRIS database - a quick note
Keywords: Python, JDBC, SQL, IRIS, Jupyter Notebook, Pandas, Numpy, and Machine Learning
This is another 5-minute simple note on invoking the IRIS JDBC driver via Python 3 within i.e. a Jupyter Notebook, to read from and write data into an IRIS database instance via SQL syntax, for demo purpose.
Last year I touched on a brief note on Python binding into a Cache database (section 4.7) instance. Now it might be time to recap some options and discussions on using Python to hook into an IRIS database, to read its data into a Pandas dataframe and a NumPy array for normal analysis, then to write some pre-processed or normalised data back into IRIS ready for further ML/DL pipelines.
Immediately there would be a few quick options popping out on top of the head:
- ODBC: How about PyODBC for Python 3 and native SQL?
- JDBC: How about JayDeBeApi for Pyhton 3 and native SQL?
- Spark: How about the PySpark and SQL?
- Python Native API for IRIS: beyond the previous Python Binding for Cache?
- IPtyhon Magic SQL %%sql? Could it work with IRIS yet?
Any other options being missed here? I am interested in trying them too.
Shall we just start with a normal JDBC approach? We will recap on ODBC, Spark and Python Native API in next brief note.
The following common components are touched in this quick demo:
- Jupyter Notebook
- Python 3
- An IRIS 2019.x instance
Out of Scope:
The following will NOT be touched in this quick note - they are important, and can be addressed separately with specific site solutions, deployments and services:
- Security end-2-end.
- Non-functional performance etc.
- Trouble-shooting and Support.
3.1 Run an IRIS instance:
I simply ran an IRIS 2019.4 container as a "remote" database server. You can use any IRIS instance to which you have the right authorised access.
zhongli@UKM5530ZHONGLI MINGW64 /c/Program Files/Docker Toolbox
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d86be69a03ab quickml-demo "/iris-main" 3 days ago Up 3 days (healthy) 0.0.0.0:9091->51773/tcp, 0.0.0.0:9092->52773/tcp quickml
3.2 Anaconda and Jupyter Notebook:
3.3 Install JayDeBeApi and JPyPe:
JayDeBeApi uses JPype 0.7 at the time of writing (Jan 2020) - it doesn't work due to a known bug, so had to be downgraded to 0.6.3
3.4 Connect into IRIS database via JDBC
There is an official JDBC into IRIS documentation here.
For Python SQL executions over JDBC , I used the following codes as an example. It connects into a data table called "
DataMining.IrisDataset" within "USER" namespace of this IRIS instance.
Total records: 150 (1, 1.4, 0.2, 5.1, 3.5, 'Iris-setosa') (2, 1.4, 0.2, 4.9, 3.0, 'Iris-setosa') (3, 1.3, 0.2, 4.7, 3.2, 'Iris-setosa') ... ... (49, 1.5, 0.2, 5.3, 3.7, 'Iris-setosa') (50, 1.4, 0.2, 5.0, 3.3, 'Iris-setosa') (51, 4.7, 1.4, 7.0, 3.2, 'Iris-versicolor') ... ... (145, 5.7, 2.5, 6.7, 3.3, 'Iris-virginica') ... ... (148, 5.2, 2.0, 6.5, 3.0, 'Iris-virginica') (149, 5.4, 2.3, 6.2, 3.4, 'Iris-virginica') (150, 5.1, 1.8, 5.9, 3.0, 'Iris-virginica')
Now we tested that Python on JDBC was working. The following is just a bit of routine data analysis and preprocessing for usual ML pipelines that we might touch on again and again for later demos and comparisons hence is attached for conveniences.
3.5 Convert SQL results to Pandas DataFrame then NumPy Array
Install Pandas and NumPy packages via Conda if they are not installed yet, similar to section 3.3 above.
Then ran the following as an example:
Let's have a routine peek into current data:
Now we got a DataFrame, and a normalised NumPy array from a source data table to our disposal.
Certainly, here we can try various routine analysis that a ML person would start with, as below, in Python to replace R as the link here?
3.6 Split data and write back into IRIS database via SQL:
Certainly we can split the data into a Training and a Validation or Test set, as usual, then write them back into temporary database tables, for some exciting on-coming ML features of IRIS:
Now if we switch to IRIS Management Console, or Terminal SQL Console, we should see 2 temp tables created: TRAIN02 with 120 rows and TEST02 with 30 rows.
I will have to stop here, as it's supposedly be a very brief quick note.
- The content above may be changed or refined.
We will simply replace section 3.3 and 3.4 with PyODBC, PySPark and Python Native API for IRIS, unless anyone wouldn't mind helping contribute a quick note as well - I will appreciate too.