Article
· Mar 1 7m read

IRIS Vector Search for Matching Companies and Climate ActionContestant

Hey, community! 👋

We are a team of Stanford students applying technology to make sense of climate action. AI excites us because we know we can quickly analyze huge amounts of text.

As we require more reports on sustainability, such as responsibility reports and financial statements, it can be challenging to cut through the noise of aspirations and get to the real action: what are companies doing

That’s why we built a tool to match companies with climate actions scraped from company sustainability reports.

In this post, we’ll show you how to implement this tool as a chatbot interfacing with an InterSystems IRIS vector database.

  • Step 1: Setup
  • Step 2: Create database table
  • Step 3: Embed content and populate database
  • Step 4: Use vector search with user input
  • Step 5: Find climate actions

Step 1: Setup

Make sure to download Docker and then follow the InterSystems community hackathon start guide. The guide walks you through creating a new database container and connecting to the locally hosted web database portal. Navigate to http://localhost:52773/csp/sys/UtilHome.csp and use the credentials username: demo, password: demo to login.

Step 2: Create database table

Now, for the fun part!

First, we use Python to create the database to store our text chunks. We use the following script:

import pandas as pd
import numpy as np
import iris
import time
import os

### Vector Database Setup
username = "demo"
password = "demo"
hostname = os.getenv("IRIS_HOSTNAME", "localhost")
port = "1972"
namespace = "USER"
CONNECTION_STRING = f"{hostname}:{port}/{namespace}"
print(CONNECTION_STRING)

### Connect to IRIS
conn = iris.connect(CONNECTION_STRING, username, password)
cursor = conn.cursor()

### Create new table. Replace BASLABS to your prefix.
embedTableName = "BASLABS.ClimateReportsEmbed"
tableDefinition = """
( 
source_url VARCHAR(1000),
page INT,
total_pages INT,
content VARCHAR(5000), 
content_embedding VECTOR(DOUBLE, 384)
)
"""

### Re-create if table already exists
if CREATE_TABLE:
    try:
        cursor.execute(f"DROP TABLE {embedTableName}")
    except:
        pass
    cursor.execute(f"CREATE TABLE {embedTableName} {tableDefinition}")
  • We use iris.connect to create an SQL conn instance in Python and connect to the database. We can then use the cursor object to perform create/read/update/delete operations within the database.
  • We define the table structure as a string in the variable tableDefinition. Notably, we include a content_embedding field of type VECTOR(DOUBLE, 384). You must make sure that the number matches the dimension of your text embeddings. In this project, we are using 384 to match the “all-MiniLM-L6-v2” embedding model from the SentenceTransformers package.
  • Finally, we use cursor.execute to create the table.

Step 3: Embed content and populate database

Next, we will populate the database.

To do so, create a file with a list of PDF links you want to scrape. You can also use our list of links that we used during the Stanford TreeHacks hackathon. Your file should look something like this:

<http://website1.xyz/pdf-content-upload1.pdf>
<http://website2.xyz/sustainability-report-2024.pdf>
...

Then, we use LangChain PyPDFLoader(url) to load and split the PDF content.

import pandas as pd
from langchain.docstore.document import Document
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_iris import IRISVector
from langchain_community.document_loaders import PyPDFLoader

# Database setup as above...

# Embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

# Setup Text Splitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\\n\\n", "\\n", ". ", "? ", "! "],
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

# Utility functions
def get_df(docs, pdf_url):
    data = [
        {
            "page": doc.metadata["page"],
            "total_pages": doc.metadata["total_pages"],
            "content": doc.page_content,
        }
        for doc in docs
    ]
    doc_df = pd.DataFrame(data)
    doc_df["source_url"] = pdf_url
    doc_embeddings = model.encode(doc_df["content"], normalize_embeddings=True).tolist()
    doc_df["content_embedding"] = doc_embeddings
    return doc_df
    
def insert_df(doc_df):
    sql = f"""
    INSERT INTO {embedTableName}
    (source_url, page, total_pages, content, content_embedding) 
    VALUES (?, ?, ?, ?, ?, ?, TO_VECTOR(?))
    """
    data = [
        (
            row["source_url"],
            row["page"],
            row["total_pages"],
            row["content"],
            str(row["content_embedding"]),
        )
        for index, row in doc_df.iterrows()
    ]
    results = cursor.executemany(sql, data)

# Main loop
with open("pdf_links.txt") as f:
        links = f.read().split("\\n")
 
for url in links:   
	loader = PyPDFLoader(url)
	documents = loader.load()
	docs = text_splitter.split_documents(documents)
	print(f"Found {len(docs)} docs for url: {url}")
	df = get_df(docs, url)
	insert_df(df)
	
  • First, we set up an embedding model, which we import from sentence_transformers.
  • Next, we define how to split the text of the reports. We use LangChains recursive text splitter that tries to split on provided characters in order until the chunks are small enough.
  • We then create two utility functions: get_df and insert_df. We utilize pandas dataframes to package text content and its metadata for later storage effectively.
  • Finally, we loop over all links in the text file, loading the PDF using LangChain, splitting into smaller chunks, embedding their contents, and inserting them into the IRIS Vector Database.

Step 4: Use vector search with user input

Finally, we will create a chatbot wrapper to perform a vector search based on a company climate description. First, we package the database functionality into an IRIS_Database class:

import iris
import time
import os
import pandas as pd
from sqlalchemy import create_engine

from sentence_transformers import SentenceTransformer

class IRIS_Database:
    def __init__(self):
        ### Vector Database Setup
        username = "demo"
        password = "demo"
        hostname = os.getenv("IRIS_HOSTNAME", "localhost")
        port = "1972"
        namespace = "USER"
        CONNECTION_STRING = f"{hostname}:{port}/{namespace}"

        self.conn = iris.connect(CONNECTION_STRING, username, password)
        self.cursor = self.conn.cursor()

        # Tables
        self.report_table = "BASLABS.ClimateReportsEmbed"

        # Embeddings
        self.embed = SentenceTransformer("all-MiniLM-L6-v2")

    def get_report_section(self, search_query, n=5):
        if not self.conn:
            self.connect()
				# Embed query using the same embedding model as before
        search_vector = self.embed.encode(
            search_query, normalize_embeddings=True
        ).tolist()
        # SQL query to perform vector search
        sql = f"""
            SELECT TOP ? source_url, page, content
            FROM {self.report_table}
            ORDER BY VECTOR_DOT_PRODUCT(content_embedding, TO_VECTOR(?)) DESC
        """
        self.cursor.execute(sql, [n, str(search_vector)])
        # Format results
        results = self.cursor.fetchall()
        return [
            dict(
                source_url=row[0],
                page=str(row[1]),
                content=row[2],
            )
            for row in results
        ]
  • First, we set up the database as before in the __init__ method. Notice that we also load the embedding model to embed our search queries later on.
  • The get_report_section does all of the heavy lifting. First, it embeds the query vector using the same model used when populating the database. Then, we use an SQL query to perform the vector search. The magic lies in ORDER BY VECTOR_DOT_PRODUCT(content_embedding)since it is here, we find the cosine similarities between text embeddings and return the most similar chunks.
  • Finally, we format the results as a Python dictionary and return to the user.

Step 5: Find climate actions

As a final step, we can create a front-facing client application called vector search. Below, we include a basic chat interface without a language model. But feel free to extend it to your needs! See our DevPost for more details on setting up an agentic tool calling workflow using DAIN!

import os
import sys
import json
from IRIS_Database import IRIS_Database

def get_report_section(search_query: str) -> list[dict]:
    """
    Perform RAG search for report data.
    """
    db = BAS_Database()
    result = db.get_report_section(search_query)
    return result
    
# ask user for
while True:
	print("Please tell me a bit more about your company and your goals. (q to quit)")
	user_input = input("> ")
	if user_input.lower().startswith("q"):
		break
		
	results = get_report_section(user_input)
	print(f"Found {len(results)} matches!")
	for r in results:
		print(r)

Wrapping up

And that’s it! We hope this can inspire you to use an embedding search to match individuals, companies, and even nations with more ambitious climate actions. Thank you to InterSystems for providing these services.

If you have any questions, please leave them in the comments below.

Happy InterSystems Hacking!

Authors: Alice, Suze, and Bubble

Discussion (0)1
Log in or sign up to continue