Article
· Oct 14, 2024 6m read

LLM Models and RAG Applications Step-by-Step - Part II - Creating the Context

We continue with this series of articles on LLM and RAG applications and in this article we will discuss the red boxed part of the following diagram:

In the process of creating a RAG application, choosing an LLM model that is appropriate to your needs (trained in the corresponding subject, costs, speed, etc.) is as important as having a clear understanding of the context you want to provide. Let's start by defining the term to be clear about what we mean by context.

What is context?

Context refers to additional information obtained from an external source, such as a database or search engine, to supplement or enhance the responses generated by a language model. The language model uses this relevant external information to generate more accurate and detailed responses rather than relying solely on what it has learned during its training. Context helps keep responses up-to-date and aligned with the specific topic of the query.

This context can be information stored in a database with tools similar to those shown by our dear community member @José Pereira in this article or unstructured information in the form of text files with which we will feed the LLM, which will be the case we are going to deal with here.

How to generate the context for our RAG application?

The first and most essential thing is, obviously, to have all the information that we consider may be relevant for the possible queries that are going to be made against our application. Once this information is arranged in such a way that it is accessible from our application, we must be able to identify which of all the documents available for our context refer to the specific question asked by the user. For our example, we have a series of PDF documents (medication leaflets) that we want to use as a possible context for the questions of the users of our application.

This point is key to the success of a RAG application, as it is just as bad for a user's confidence to answer with generalizations and vagueness typical of an LLM as it is to answer with a totally wrong context. This is where our beloved vector databases come in.

Vector databases

You have probably heard of "vector databases" before, as if they were a new type of database, such as relational or document databases. Nothing could be further from the truth. These vector databases are standard databases that support vector data types as well as operations related to them. Let's see how this type of data will be represented in the project associated with the article:

Now let's take a look at how a record would be displayed:

Vector databases...what for?

As we explained in the previous article with LLMs, the use of vectors is key in language models, as they can represent concepts and the relationships between them in a multidimensional space. In the case at hand, this multidimensional representation will be the key to identifying which of the documents in our context will be relevant to the question asked.

Perfect, we have our vector database and the documents that will provide the context, now we just need to record the content of these documents within our database, but... With what criteria?

Models for vectorization

How? Another model? Isn't the LLM enough for us? Well... there's no need to bother our LLM to vectorize the information of our context, we can use smaller language models that are more suited to our needs for this task, such as models trained to detect similarities between sentences. You can find a myriad of them in Hugging Face,  each one trained with a specific set of data that will allow us to improve the vectorization of our data.

And if this does not convince you to use one of these models for vectorization, just say that generally this type of models...

Let's see in our example how we invoke the chosen model for these vectorizations:

if not os.path.isdir('/app/data/model/'):
    model = sentence_transformers.SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')            
    model.save('/app/data/model/')

Here we are downloading the model chosen for our case to our local computer. This mini-LM is multilingual so we can vectorize in both Spanish and English without any problem.

Chunking

If you have already tinkered with language models, you have probably already faced the challenge of chunking. What is this chunking? Very simple, it is the division of the text into smaller fragments that may contain a relevant meaning. By means of this chunking of our context, we can make queries on our vector database that extract those documents from our context that may be relevant in relation to the question asked.

What are the criteria for this chunking? Well, there really isn't a magic criterion that allows us to know how long our chunks have to be in order to be as accurate as possible. In our example we are using a Python library provided by langchain to perform this chunking, although any other method or library could be used for this:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap  = 50,
)
path = "/app/data"
loader = PyPDFDirectoryLoader(path)
docs_before_split = loader.load()
docs_after_split = text_splitter.split_documents(docs_before_split)

As you can see, the chosen size is 700 characters, with an overlap of 50 to avoid cutting words. These fragments extracted from our documents will be the ones we will vectorize and insert into our database.

This chunking process can be optimized as much as you want by means of "lemmatization", through which we can transform the words into their corresponding lemma (without tenses, plurals, gender, etc.) and thus eliminate certain noise for the generation of the vector, but we are not going to go into that, on this page you can see a more detailed explanation.

Vectorization of fragments

Ok, we have our snippets extracted from each of our documents, it's time to vectorize and insert into our database, let's take a look at the code to understand how we could do it.

for doc in docs_after_split:
    embeddings = model.encode(doc.page_content, normalize_embeddings=True)
    array = np.array(embeddings)
    formatted_array = np.vectorize('{:.12f}'.format)(array)
    parameters = []
    parameters.append(doc.metadata['source'])
    parameters.append(str(doc.page_content))
    parameters.append(str(','.join(formatted_array)))
    cursorIRIS.execute("INSERT INTO LLMRAG.DOCUMENTCHUNK (Document, Phrase, VectorizedPhrase) VALUES (?, ?, TO_VECTOR(?,DECIMAL))", parameters)
connectionIRIS.commit()

As you can see, we will carry out the following steps: 

  1. We go through the list of all the pieces obtained from all the documents that will form our context.
  2. For each fragment we vectorize the text (using the sentence_transformers library). 
  3. We create an array using the numpy library with the formatted vector and transform it into a string.
  4. We register the document information with its associated vector in our database. If you see, we are executing the TO_VECTOR command that will transform the vector string that we have passed to the appropriate format.

Conclusion

In this article we have seen the need to have a vector database for the creation of the necessary context in our RAG application, we have also reviewed how to cut up and vectorize the information of our context for its registration in said database.

In the next article we will see how to query our vector database based on the question that the user sends to the LLM model and how, by searching for similarities, we will build the context that we will pass to the model. Don't miss it!

Discussion (0)1
Log in or sign up to continue