Article
· Feb 14, 2024 4m read

Data Tagging in IRIS Using Embedded Python and the OpenAI API

The invention and popularization of Large Language Models (such as OpenAI's GPT-4) has launched a wave of innovative solutions that can leverage large volumes of unstructured data that was impractical or even impossible to process manually until recently. Such applications may include data retrieval (see Don Woodlock's ML301 course for a great intro to Retrieval Augmented Generation), sentiment analysis, and even fully-autonomous AI agents, just to name a few!

In this article, I want to demonstrate how the Embedded Python feature of IRIS can be used to directly interface with the Python OpenAI library, by building a simple data tagging application that will automatically assign keywords to the records we insert into an IRIS table. These keywords can then be used to search and categorize the data, as well as for data analytics purposes. I will use customer reviews of products as an example use case.

Prerequisites

  • A running instance of IRIS
  • An OpenAPI API key (which you can create here)
  • A configured development environment (I will be using VS Code for this article)

The Review Class

Let us start by creating an ObjectScript class that will define the data model for our customer reviews. To keep things simple, we will only define 4 %String fields: the customer's name, the product name, the body of the review, and the keywords we will generate. The class should extend %Persistent so that we can save its objects to disk.

Class DataTagging.Review Extends %Persistent
{
Property Name As %String(MAXLEN = 50) [ Required ];
Property Product As %String(MAXLEN = 50) [ Required ];
Property ReviewBody As %String(MAXLEN = 300) [ Required ];
Property Keywords As %String(MAXLEN = 300) [ SqlComputed, SqlComputeOnChange = ReviewBody ];
}

Since we want the Keywords property to be automatically computed on insert or update to the ReviewBody property, I am marking it as SqlComputed. You can learn more about computed values here.

The KeywordsComputation Method

We now want to define a method that will be used to compute the keywords based on the review body. We can use Embedded Python to interact directly with the official openai Python package. But first, we need to install it. To do so, run the following shell command:

<your-IRIS-installation-path>/bin/irispip install --target <your-IRIS-installation-path>/Mgr/python openai

We can now use OpenAI's chat completion API to generate the keywords:

ClassMethod KeywordsComputation(cols As %Library.PropertyHelper) As %String [ Language = python ]
{
    '''
    This method is used to compute the value of the Keywords property
    by calling the OpenAI API to generate a list of keywords based on the review body.
    '''
    from openai import OpenAI

    client = OpenAI(
        # Defaults to os.environ.get("OPENAI_API_KEY")
        api_key="<your-api-key>",
    )

    # Set the prompt; use few-shot learning to give examples of the desired output
    user_prompt = "Generate a list of keywords that summarize the content of a customer review of a product. " \
                + "Output a JSON array of strings.\n\n" \
                + "Excellent watch. I got the blue version and love the color. The battery life could've been better though.\n\nKeywords:\n" \
                + "[\"Color\", \"Battery\"]\n\n" \
                + "Ordered the shoes. The delivery was quick and the quality of the material is terrific!.\n\nKeywords:\n" \
                + "[\"Delivery\", \"Quality\", \"Material\"]\n\n" \
                + cols.getfield("ReviewBody") + "\n\nKeywords:"
    # Call the OpenAI API to generate the keywords
    chat_completion = client.chat.completions.create(
        model="gpt-4",  # Change this to use a different model
        messages=[
            {
                "role": "user",
                "content": user_prompt
            }
        ],
        temperature=0.5,  # Controls how "creative" the model is
        max_tokens=1024,  # Controls the maximum number of tokens to generate
    )

    # Return the array of keywords as a JSON string
    return chat_completion.choices[0].message.content
}

Notice how in the prompt, I first specify the general instructions of how I want GPT-4 to "generate a list of keywords that summarize the content of a customer review of a product," and then I give two example inputs along with the desired outputs. I then insert cols.getfield("ReviewBody") and end the prompt with the word "Keywords:", nudging it to complete the sentence by providing the keywords in the same format as the examples I gave it. This is a simple example of the Few-Shot Prompting technique.

I chose to store the keywords as a JSON string for the sake of simplicity of presentation; a better way to store them in production could be a DynamicArray, but I will leave this as an exercise to the reader.

Generating Keywords

We can now test our data tagging application by inserting a row into our table using the following SQL query through the Management Portal:

INSERT INTO DataTagging.Review (Name, Product, ReviewBody)
VALUES ('Ivan', 'BMW 330i', 'Solid car overall. Had some engine problems but got everything fixed under the warranty.')

As you can see below, it automatically generated four keywords for us. Well done!

Conclusions

To summarize, the ability of InterSystems IRIS to embed Python code allows for a large range of possibilities when dealing with unstructured data. Leveraging the power of OpenAI for automated data tagging is just one example of what one can achieve with this powerful feature. This leads to fewer human errors and higher efficiency overall.

Discussion (4)3
Log in or sign up to continue