Article
· Dec 29, 2023 5m read

Performance tests IRIS - PostgreSQL - MySQL using Python

It seems like yesterday when we did a small project in Java to test the performance of IRIS, PostgreSQL and MySQL (you can review the article we wrote back in June at the end of this article). If you remember, IRIS was superior to PostgreSQL and clearly superior to MySQL in insertions, with no big difference in queries.

Well, shortly after @Dmitry Maslennikov told me "Why don't you test it from a Python project?" Well, here is the Python version of the tests we previously performed using the JDBC connections.

First of all, let you know that I am not an expert in Python, so if you see anything that could be improved, do not hesitate to contact me.

For this example I have used Jupyter Notebook, which greatly simplifies Python development and allows us to see step by step what we are doing. Associated with this article you have the application so that you can do your own tests.

Warning for Windows users

If you clone the GitHub project in Visual Studio Code you may have to change the default end of line configuration from CRLF to LF to be able to correctly deploy the containers:

If you are going to try to reproduce the tests on your computers, you must take the following into consideration: Docker Desktop will request permissions to access the folders on your computers that it needs to deploy the project. If you have not configured access permission to these folders before launching the Docker containers, the initial creation of the test table in PostgreSQL will fail, so before launching the project you must configure shared access to the project folders in your DockerDesktop.

To do this you must access Settings -> Resources -> File sharing and add the folder where you have cloned the project to the list:

You are warned!

 

Test of performance

For these tests we will use a fairly simple table with the most basic information possible about a patient. Here you can see the command to create the table in SQL:

CREATE TABLE Test.Patient (
    Name VARCHAR(225),
    Lastname VARCHAR(225),
    Photo VARCHAR(5000),
    Phone VARCHAR(14),
    Address VARCHAR(225)    
)

As you can see, we have defined the patient's photo as a VARCHAR(5000), the reason for this is because here we are going to include (theoretically) the vectorized information of the photo. A few months ago I published an article explaining how using Embedded Python we could implement IRIS, a facial recognition system (here) where you can see how images are transformed into vectors for later comparison. Well, the issue of vectorization comes from the fact that said vector format is the norm in many Machine Learning models and it never hurts to test with something similar to reality (just something).

 

Jupyter Notebook Setup

To simplify the development of the project in Python as much as possible, I have used the magnificent Jupyter Notebook tool that allows us to develop each of the functionalities that we will need step by step.

Here's a look at our Jupyter:

Let's take a look at the most interesting points of it:

Importing libraries:

import iris
import names
import numpy as np
from datetime import datetime
import psycopg2
import mysql.connector
import matplotlib.pyplot as plt
import random_address
from phone_gen import PhoneNumber

Connecting to the databases:

IRIS:

connection_string = "iris:1972/TEST"
username = "superuser"
password = "SYS"
connectionIRIS = iris.connect(connection_string, username, password)
cursorIRIS = connectionIRIS.cursor()
print("Connected")

PostgreSQL:

connectionPostgres = psycopg2.connect(database="testuser",
                        host="postgres",
                        user="testuser",
                        password="testpassword",
                        port="5432")
cursorPostgres = connectionPostgres.cursor()
print("Connected")

MySQL:

connectionMySQL = mysql.connector.connect(
  host="mysql",
  user="testuser",
  password="testpassword"
)
cursorMySQL = connectionMySQL.cursor()
print("Connected")

Generation of the values to be inserted

phone_number = PhoneNumber("USA")
resultsIRIS = []
resultsPostgres = []
resultsMySQL = []
parameters =  []
for x in range(1000):
    rng = np.random.default_rng()
    parameter = []
    parameter.append(names.get_first_name())
    parameter.append(names.get_last_name())
    parameter.append(str(rng.standard_normal(50)))
    parameter.append(phone_number.get_number())
    parameter.append(random_address.real_random_address_by_state('CA')['address1'])
    parameters.append(parameter)

print("Parameters built")

Insertion into IRIS

date_before = datetime.now()

cursorIRIS.executemany("INSERT INTO Test.Patient (Name, Lastname, Photo, Phone, Address) VALUES (?, ?, ?, ?, ?)", parameters)
connectionIRIS.commit()
difference = datetime.now() - date_before
print(difference.total_seconds())
resultsIRIS.append(difference.total_seconds())

Insertion into PostgreSQL

date_before = datetime.now()
    
cursorPostgres.executemany("INSERT INTO test.patient (name, lastname, photo, phone, address) VALUES (%s,%s,%s,%s,%s)", parameters)
connectionPostgres.commit()
difference = datetime.now() - date_before
print(difference.total_seconds())
resultsPostgres.append(difference.total_seconds())

Insertion into MySQL

date_before = datetime.now()
    
cursorMySQL.executemany("INSERT INTO test.patient (name, lastname, photo, phone, address) VALUES (%s,%s,%s,%s,%s)", parameters)
connectionMySQL.commit()
difference = datetime.now() - date_before
print(difference.total_seconds())
resultsMySQL.append(difference.total_seconds())

For our test I have decided to insert the following set of values for each database:

  • 1 insertion with 1000 patients.
  • 1 insertion with 5000 patients.
  • 1 insertion with 20,000 patients.
  • 1 insertion with 50,000 patients.

Keep in mind when performing the tests that the longest time spent is creating the values to be inserted by Python. To bring it closer to reality, I have launched several tests in advance so that the databases already have a significant set of records (around 200,000 records).

Test results

Insertion of 1000 patients:

  • InterSystems IRIS: 0.037949 seconds.
  • PostgreSQL: 0.106508 seconds.
  • MySQL: 0.053338 seconds.

Insertion of 5,000 patients:

  • InterSystems IRIS: 0.162791 seconds.
  • PostgreSQL: 0.432642 seconds.
  • MySQL: 0.18925 seconds.

Insertion of 20,000 patients:

  • InterSystems IRIS: 0.601944 seconds.
  • PostgreSQL: 1.803113 seconds.
  • MySQL: 0.594396 seconds.

Insertion of 50,000 patients:

  • InterSystems IRIS: 1.482824 seconds.
  • PostgreSQL: 4.581251 seconds.
  • MySQL: 2.162996 seconds.

Although this is a fairly simple test, it is very significant since it allows us to see the trend of each database regarding insertion performance.

Conclusions

If we compare the performance in the tests carried out with the Java project and the current one in Python we will see that on this occasion the behavior of PostgreSQL is clearly inferior to the Python project, being 4 times slower than InterSystems IRIS,  while MySQL has improved compared to to the Java version.

Unquestionably InterSystems IRIS remains the best of the three, with more linear behavior and better insertion performance, regardless of the technology used.

Technical characteristics of the laptop used for the tests:

  • Operative System: Microsoft Windows 11 Pro.
  • Processor: 13th Gen Intel(R) Core(TM) i9-13900H, 2600 Mhz.
  • RAM: 64 GB.
Discussion (6)2
Log in or sign up to continue

Great Stuff!
This matches seamlessly to the  "real" Customer benchmark I did in 2009 (with Caché)
against PostgreSQL and MySQL
Years went by but the principles haven't changed.

  • keep memory consumption low
  • data transfer between memory and "disk" (or what mimics a disk) is the the slowest component of your system (except human user) so keep it as flat as possible

Glad to see that this is still as valid as it was since ever.
After so many years, all new technology tricks didn't change that.

I've created a new branch on GitHub project to include Oracle tests...I'm not going to merge it to the main branch because Oracle docker image consumes a lot of memory and requires to modify the Docker Desktop configuration, but if anyone is interested on it here is the branch published:

https://github.com/intersystems-ib/workshop-py-performance/tree/oracle

Here is the comparative graph with Oracle included:

As you can see IRIS and Oracle are equivalents in terms of performance.