Building a low cost serverless Retrieval-Augmented Generation (RAG) solution

The problem

Large language models (LLMs) can generate complex text and solve numerous tasks such as question-answering, information extraction, and text summarization. However, they may suffer from issues such as information gaps or hallucinations. In this blog article, we will explore how to mitigate these issues using Retrieval Augmented Generation (RAG) and build a low-cost solution in the process.

Lack of information

LLMs are trained on large datasets and can memorize part of their training data. There could be two reason why an LLM is unable to answer a question given the question and answer are not very complex. One reason could be that the relevant information was not available within the training data at all. Another reason could be that, even though the relevant information was available within the training data, it was likely pruned due to relative small model size, as smaller models are less likely to “retain” as much information from their training data as their larger counterparts.

Hallucination

Hallucination in LLMs occurs when large language models generate random or false information to fill knowledge gaps. Let’s look at the following interaction with an LLM. Note that there was no given context prior to this interaction.

Copilot hallucination example

As can be seen, the LLM hallucinated and concluded that the color of the hat is green, despite there being no information to support that conclusion.

Here comes Retrieval-Augmented Generation

Hallucination and information lack can both be mitigated through the use of Retrieval-Augmentation Generation (RAG). The LLM is provided with more context regarding the user prompt it should answer, which lowers (but does not eliminate) the chances of the model hallucinating as well as the chances of the model having no relevant information at all to answer the given prompt. To achieve that goal, RAG solutions require the following main components:

  • An embedding model: The embedding model converts text data into vector embeddings that will be used to populate the knowledge database. A vector embedding is a vector based representation of data (e.g. text, image, etc.) with the characteristic that similar items have similar vector embeddings. It is then possible to query similar vectors based on similarity metrics such as the cosine similarity to identify similar items.
  • A knowledge database: The data used to fill the LLM context is initially imported into the knowledge database. Later on, it is retrieved from the knowledge database. Vector databases are usually used as knowledge database.
  • A Generative model: An LLM that, given the user prompt and the context retrieved from the knowledge database, provides an answer to the user prompt.

Using these components, there are three phases to consider:

  • Import / indexing: During the import and indexing phase, which can be a one-time process or done continuously, data is loaded into the knowledge database after being adequately transformed using the embedding model.
  • Retrieval: During the retrieval phase, the user prompt is transformed into an embedding using the embedding model, and that embedding is used to retrieve relevant documents from the knowledge database.
  • Generation: In this phase, a context is built using retrieved documents, and the generative model is used to generate an answer to the user prompt using that context.

Our low-cost serverless RAG solution

As mentioned in the introduction of this blog post, we want to build a low-cost serverless RAG solution. The following diagram represents the solution we want to build: Low cost RAG architecture

We will specifically use the following components to build our low-cost serverless RAG solution:

  • Knowledge database: According to a former colleague, Amazon Athena is the best database (read SQL querying engine) in the world. There is no reason for us not to use the best database in the world. We will store our data in Amazon S3 and query it using Amazon Athena. Amazon Athena, being a serverless service, makes it suitable for our solution. It currently does not support querying vectors; however, we will see how to circumvent that limitation for our use case.
  • Embedding and generative models: For the embeddings generation as well as the answer generation, we will use Amazon Bedrock.
  • Amazon Lambda functions: We will use Amazon Lambda functions for all the computation and handling that happens in the different RAG phases.

Locality Sensitive Hashing is almost all we need

Our knowledge database, Amazon Athena, does not support similarity-based querying of vectors as of now. However, we can resort to a practical alternative: Locality-sensitive Hashing (LSH).
Locality-sensitive Hashing (LSH) is a fuzzy hashing technique that produces similar hashes for similar items with high probability and can be used to implement approximate nearest neighbor vector searches.

Instead of querying vectors, we will query locality-sensitive hashes using commonly available string similarity functions, in this case, the Hamming distance, as it is supported by our knowledge database.

One of the simplest and cheapest LSH functions is bit sampling: Given two bit vectors of the same size, the Hamming similarity corresponds to the number of bits that are similar (either both 0 or both 1) between the two vectors. Embeddings aren’t generally bit vectors, so what can we do about that? We need to incorporate a binarization step into our LSH function. This can be done by defining thresholds; for example, for each vector dimension, consider everything below the threshold to be 0 and everything above to be 1. The following diagram illustrates how given vectors, locality-sensitive hashes can be computed through binarization with a fixed zero threshold for all dimensions and how the Hamming similarity can then be used as a proxy for the cosine similarity. The computed similarity values are then normalized to the range 0-1. Low cost RAG: vector similarity with LSH

There are more advanced implementations of LSH but that is out of the scope of this blog article. More information about LSH can be found in A Simple Introduction to Locality Sensitive Hashing (LSH).

For our solution, instead of using a custom LSH implementation, we rely on the lshashpy3 package. We slightly modify the main class, LSHash, as follows:

  • A random seed can be passed as initialization parameter, which helps deterministically set all random parameters of the hash function. This allows us to use the exact same hash function without having to manage file imports and exports.
  • We use a single hash table, so that our lsh is a 1-dimensional vector.

For the import phase we use the following code snippet to generate embeddings:


def get_embedding(
    text: str,
    model_id: str = EMBEDDING_MODEL_ID,
    content_type: str = CONTENT_TYPE,
    accept: str = ACCEPT,
) -> list[float]:
    body = json.dumps({"inputText": text})
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id, accept=accept, contentType=content_type)
    response_body = json.loads(response["body"].read())
    embedding = response_body.get("embedding")
    return embedding

The knowledge database structure

We use a single Amazon Athena table as knowledge database. The table has the following structure to save text chunks for retrieval:

  • text: A text chunk from the document. Text chunks overlap each other to ensure that relevant information is not lost due to an improper split.
  • lsh: A locality-sensitive hash built from the embedding generated for the text.
  • Some other attributes such as document ID, chunk position, etc., that are not relevant for the basic functionality of our RAG solution.

To query our knowledge database, we use the following SQL statement that queries text chunks with the highest Hamming similarity using a score range of 0 to 100:

WITH scored_documents AS (
    SELECT
        "lsh", "text",
        (length(lsh) - hamming_distance(lsh, '{query_lsh}')) * 100.0 / length(lsh) score
    FROM
        "awsdatacatalog"."{athena_database}"."{athena_table}"
)

SELECT * FROM scored_documents
WHERE
    score >= {score_threshold}
ORDER BY score DESC
LIMIT {top_n_documents}

The augmentation and generation phase

For the augmentation and generation step we incorporate the user query and the retrieved documents into the following query template and let the chat model generate an answer :

You are a friendly AI-Bot and answer queries about any topic within your knowledge and particularly within your context.
Your answers are as exact and brief as possible.
In case you are not able to answer a query, you clearly state that you do not know the answer.

Answer the following query by summarizing information within your context:
{{{query}}}

You can use the following information to answer the query:
{{{documents}}}

The full code is available in the companion repository on Github.

When deployed, the stack outputs the S3 URI where documents can be uploaded to be added to the knowledge database.

Outputs:
LowCostServerlessRAGStack.DocumentImportFolder = s3://lowcostserverlessragstack-rag70e8-pr5zms/input/

Conclusion

In this blog article, we discussed how lack of information or hallucination can be an issue when working with LLMs and how RAG solutions can help alleviate it. We looked at how locality-sensitive hashing (LSH) can be used as a proxy for any database that does not support vector querying but at least supports some string similarity functions. While the solution is fully functional, there are some limitations to consider. Each query to the knowledge database (Athena) will run a comparison against every document available in the database. Systems that support vector querying are generally better at this, as they can index vectors and only run comparisons on relevant candidates. Another limitation is the number of parallel queries one can run on Athena. This solution only makes sense for a relatively low number of concurrent users. The pricing model of Athena, which is mainly based on the amount of data scanned by queries, makes this solution one with very low idle costs.

— Franck


Similar Posts You Might Enjoy

An unsung hero of Amazon SageMaker: Local Mode

Amazon SageMaker offers a highly customizable platform for machine learning at scale. Job execution within Amazon SageMaker can take some time to set up, which can be inconvenient or even time consuming during development and debugging phases. Running training and processing jobs locally can greatly increase the speed of development and debugging before running them at scale on AWS. - by Franck Awounang Nekdem

GO-ing to production with Bedrock RAG Part 1

The way from a cool POC (proof of concept), like a walk in monets garden, to a production-ready application for an RAG (Retrieval Augmented Generation) application with Amazon Bedrock and Amazon Kendra is paved with some work. Let`s get our hands dirty. With streamlit and langchain, you can quickly build a cool POC. This two-part blog is about what comes after that. - by Gernot Glawe

Hyperparameter Tuning with Ray 2.x and AWS Sagemaker

Finding the right set of hyperparameters when training machine learning models is a resource-consuming and costly task. In this post, we try to simplify this by exploring hyperparameter tuning for reinforcement learning models with Ray 2.x on AWS Sagemaker. - by Chrishon Nilanthan