Embedded Embeddings Database: Building a low cost serverless RAG solution

The problem

Retrieval-Augmented Generation (RAG) solutions are an impressive way to talk to one’s data. One of the challenges of RAG solutions is the associated cost, often driven by the vector database. In a previous blog article I presented how to tackle this issue by using Athena with Locality Sensitive Hashing (LSH) as a knowledge database. One the of the main limitations with Athena is the latency and the low number of concurrent queries. In this new blog article, I present a new low-cost serverless solution that makes use of an embedded vector database, SQLite, to achieve a low cost while maintaining high concurrency.

Embedding databases

The following diagram presents the solution architecture.

SQlite RAG serverless solution architecture The solution uses a SQLite database that resides in an S3 bucket at rest. For queries and imports, the SQLite database is loaded into the corresponding Lambda function. The import Lambda function ensures that all changes are uploaded back to the S3 bucket.

At the infrastructure level, we need to ensure that the Lambda-import function is never invoked concurrently, as this could otherwise result in data loss. This can been seen as an anti-pattern, and AWS currently does not allow this for a lambda function that is triggered by an SQS Queue. As a workaround we use two queues:

  • s3 Queue: A queue that collects all object-creation related events, or in our case, all created file events, so that the files can be later added to the RAG knowledge database.
  • import Queue: A FIFO queue with a single group ID that will trigger the Lambda-import function. Having a single group ID for all messages in the queue gives us the guarantee that the queue will never trigger multiple instances of the lambda function.

Delayed database updates: The Lambda-query function only downloads the sqlite database during cold starts. This means that even if the database is a few hundred MBs, warm starts aren’t affected at all by the download time of the database file, allowing the Lambda function to query the vector database at a blazing fast speed. However, this also means the local copy of the database is only synced during cold starts. In case of intensive usage, it can take up to a few hours for all active lambda instances to be stopped or restarted. Therefore, the Lambda-query function might, in some edge cases, need a few hours to have the updated version of the database locally. This isn’t much of an issue as imports for RAG solutions aren’t generally done in a near real-time fashion.

Getting the right Lambda environment

The whole solution relies on the sqlite-vss extension, which adds vector search functionalities to SQLite. To use it, one needs to install the extension with the corresponding Python package and load the extension into the SQLite database.

The first challenge when trying to load the extension in a Lambda function is that the Python runtime available in AWS Lambda was not installed in a way that allows SQLite to load extensions, which results in the following error when trying to load an extension into SQLite:

SQlite load extension error: AttributeError: ‘sqlite3.Connection’ object has no attribute ’enable_load_extension’

To solve this issue, we use a custom Docker image on which we install the runtime interface client for Python. The following snippet shows the part of the Dockerfile that is relevant to our solution.

...
# ### Customization start ###
# Install sqlite-vss dependencies
RUN apt update && apt install libgomp1 libatlas3-base -y

# Copy python requirements file
COPY requirements.txt ${LAMBDA_TASK_ROOT}

# Install python requirements
RUN python3 -m pip install -r requirements.txt --target ${FUNCTION_DIR}

# Copy python code to target dir
COPY src ${FUNCTION_DIR}/

# ### Customization end ###
...

Database design and interactions

We now have a Docker-based Lambda image in which we can load SQLite extensions.

Let’s define our database tables structure. We need two tables: A main table and an embeddings table. To the question: Why don’t we simply use a single table ? Let’s say it’s an sqlite-vss restriction.

-- main table
CREATE TABLE IF NOT EXISTS documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    text TEXT,
    timestamp DATETIME
);

-- embeddings table
CREATE virtual table IF NOT EXISTS vss_documents using vss0(
  text_embedding({embedding_size})
);

Before doing anything with the database, we have to make sure that the sqlite-vss extension is loaded into database. We use the following code snippet to initialize the database by loading the sqlite-vss extension and creating all needed tables.

import sqlite3
import sqlite_vss
...
def initialize_db(database: str | bytes = ON_DISK_DATABASE, embedding_size: int = EMBEDDING_SIZE):
    db: sqlite3.Connection = sqlite3.connect(database)
    db.enable_load_extension(True)
    sqlite_vss.load(db)
    db.enable_load_extension(False)
    db.execute(SQL_CREATE_DOCUMENTS_TABLE)
    db.execute(SQL_CREATE_VSS_DOCUMENTS_TABLE_TEMPLATE.format(embedding_size=embedding_size))
    return db

To query similar embeddings from the database, we use the following SQL statement:

-- SQL_QUERY_TEMPLATE_STARTING
WITH matched_documents (rowid, distance) AS (
    SELECT
        rowid,
        distance
    FROM vss_documents
    WHERE
        vss_search(
            text_embedding,
            vss_search_params(
                ?,  -- embedding
                {top_n_documents}
            )
        )
    ORDER BY distance ASC
)

SELECT
    d.*,
    m.distance
FROM matched_documents m
LEFT JOIN documents d ON m.rowid=d.rowid
WHERE
    m.distance <= {distance_threshold}
ORDER BY m.distance ASC

The temporary table matched_documents contains ids and distances of matching documents. We use it to find the actual documents from the documents table, as we need the text from each document.

We are limited in the way we can query the embeddings table in SQLite versions prior to 3.41.0. Due to a bug in prior versions of SQLite, a limit statement can not be used on the embeddings table as it is a virtual table, so the number of results has to be limited by using vss_search_params, as shown in the previous SQL snippet.

When trying to query the embeddings before having saved any documents into the embeddings table, you may encounter the following error from the sqlite-vss extension, depending on your SQLite version:

SQlite load extension error: what(): Error in virtual void faiss::IndexFlat::search(faiss::idx_t, const float*, faiss::idx_t, float*, faiss::idx_t*, const faiss::SearchParameters*) cost at … Error: ‘k > 0’ failed

The error seems to occur with SQLite versions prior to 3.41.0. To be on the safe side, we avoid querying the knowledge database when it is empty. We use the following Python function to query documents given an embedding:

def query_db_documents(
    embedding: list[float] | np.ndarray,
    connection: sqlite3.Connection,
    top_n_documents: int = TOP_N_DOCUMENTS,
    distance_threshold: float = MAX_DISTANCE_THRESHOLD,
) -> list[dict]:
    serialized_embedding = get_serialized_embedding(embedding)

    SQL_QUERY_VSS_DOCUMENTS_COUNT = """SELECT count(*) FROM vss_documents"""

    query = SQL_QUERY_DOCUMENTS_TEMPLATE.format(top_n_documents=top_n_documents, distance_threshold=distance_threshold)

    cursor: sqlite3.Cursor = connection.cursor()
    cursor.row_factory = sqlite3.Row

    vss_documents_count: int = cursor.execute(SQL_QUERY_VSS_DOCUMENTS_COUNT).fetchone()[0]

    result: list[dict] = []
    if vss_documents_count > 0:
        result_rows = cursor.execute(query, [serialized_embedding])
        result = [dict(item) for item in result_rows]
    return result

We obviously need to be able to add documents to our knowledge database. The SQL snippets for inserts are quite straightforward:

-- SQL_INSERT_DOCUMENT
INSERT INTO documents(text, timestamp)
VALUES(:text, :timestamp)

-- SQL_INSERT_VSS_DOCUMENT
INSERT INTO vss_documents(rowid, text_embedding)
VALUES(:rowid, :text_embedding)

We use them as-is with the following Python function to save new documents into the knowledge database.

def save_document_into_db(
    document: dict,
    connection: sqlite3.Connection,
    sql_insert_document_query: str = SQL_INSERT_DOCUMENT,
    sql_insert_vss_document_query: str = SQL_INSERT_VSS_DOCUMENT,
):

    document_clone = document.copy()
    if "timestamp" not in document_clone:
        document_clone["timestamp"] = datetime.datetime.now()
    embedding: list[float] | np.ndarray = document_clone.pop("embedding")
    serialized_embedding = get_serialized_embedding(embedding)

    document_lastrowid = connection.execute(sql_insert_document_query, document_clone).lastrowid
    vss_document_lastrowid = connection.execute(
        sql_insert_vss_document_query, [document_lastrowid, serialized_embedding]
    ).lastrowid

    return document_lastrowid == vss_document_lastrowid

The documents table and the embeddings table share the same rowid values. This is another limitation of SQLite/sqlite-vss.

So far, we have assumed that the SQLite database was available locally. However, this is only the case during warm starts of our Lambda functions. During cold starts, the database has to be downloaded from the S3 bucket and saved locally. We do it with the following python code:

def get_s3_file_locally(bucket: str, key: str, local_path: str = DEFAULT_LOCAL_DB_PATH):
    try:
        s3_client.head_object(Bucket=bucket, Key=key)
        with open(local_path, "wb") as file_out:
            s3_client.download_fileobj(bucket, key, file_out)
    except botocore.exceptions.ClientError as err:
        if err.response["Error"]["Code"] == "404":
            # The file does not exists yet, so we will just create a new one locally
            print("File not existing yet, but it is fine!")
            pass
        else:
            raise

    return local_path

It is totally fine if the database does not exists, as it will be created locally by the initialization function.

Lambda handlers

The lambda-query function isn’t much different that the one used for the Athena based RAG solution so we will only have a look at the Lambda-import function.

The lambda_handler is invoked when the sqlite database is locally available. As previously mentioned, the download from S3 only happens during cold-starts. The lambda_handler processes all elements and collect items that couldn’t be processed to report them. The s3 record processing function loads and clean the file, convert it into database items with the corresponding embeddings and save the documents into the database.

def lambda_handler(event: dict[str, object], context: dict[str, object]):
    sqs_batch_response: dict = {"batch_item_failures": []}

    silenced_errors = []
    if event:
        records = event["Records"]
        batch_item_failures = []

        for record in records:
            record_body_reconstructed: dict[str, str | list | dict] = json.loads(record["body"])
            try:
                sqs_records: list[dict] = record_body_reconstructed["Records"]
                for sqs_record in sqs_records:
                    s3_records: list[dict] = json.loads(sqs_record["body"])["Records"]
                    for s3_record in s3_records:
                        process_single_s3_record(s3_record)
            except Exception as e:
                batch_item_failures.append({"itemIdentifier": record["messageId"]})
                silenced_errors.append(str(e))

        sqs_batch_response["batchItemFailures"] = batch_item_failures
        upload_file(SQLITE_DB_S3_BUCKET, SQLITE_DB_S3_KEY)

    if silenced_errors:
        print("SILENCED ERRORS: ", silenced_errors)

    return sqs_batch_response

At the end of the Lambda-import function, the local SQLite database is uploaded to the S3 bucket. Since the Lambda-import function is not triggered concurrently and is the only one to update the database file on S3, we are guaranteed that a running instance of this lambda function always has the latest version of the database and can therefore overwrite the version in S3 without any issues.

The full code is available in the companion repository on Github.

When deployed, the stack outputs the S3 URI where documents can be uploaded to be added to the knowledge database, along with some other helpful information.

Outputs:
LowCostServerlessRAGStack.DocumentImportFolder = s3://lowcostserverlessragsqlitestack-rag70e8-pr5zms/input/
LowCostServerlessRAGSQLiteStack.LambdaQueryName = LowCostServerlessRAGSQLit-RAGDocumentQuery415C904-4kzeQQkSzUOk

Conclusion

In this blog article, we discussed how it is possible to get away from classical knowledge databases while still having a highly concurrent, low-cost solution. The presented solution comes with a few limitations: The overall database size can not be more than 10GB, a single columns can not contain more than 1GB of data and cold starts of the lambda functions grows with the database size. If most queries hit cold starts, it may make be helpful to have the database integrated into the Docker container and updated daily to further reduce costs instead of having it downloaded from S3. Nevertheless, this solution provides some nice advantages, such as being able to scale as much as the computing solution allows, and almost no idle costs, which makes it possible to have a highly scalable solution that does not need to be shut down when not in use.

The author of SQLite-VSS is now working on an extremely small successor to sqlite-vss that should work everywhere SQLite works, and without any extras: SQLite-vec. You may want to have a look at it.

— Franck


Links:

Similar Posts You Might Enjoy

Building a low cost serverless Retrieval-Augmented Generation (RAG) solution

Large language models (LLMs) can generate complex text and solve numerous tasks such as question-answering, information extraction, and text summarization. However, they may suffer from issues such as information gaps or hallucinations. In this blog article, we will explore how to mitigate these issues using Retrieval Augmented Generation (RAG) and build a low-cost solution in the process. - by Franck Awounang Nekdem

An unsung hero of Amazon SageMaker: Local Mode

Amazon SageMaker offers a highly customizable platform for machine learning at scale. Job execution within Amazon SageMaker can take some time to set up, which can be inconvenient or even time consuming during development and debugging phases. Running training and processing jobs locally can greatly increase the speed of development and debugging before running them at scale on AWS. - by Franck Awounang Nekdem

GO-ing to production with Bedrock RAG Part 1

The way from a cool POC (proof of concept), like a walk in monets garden, to a production-ready application for an RAG (Retrieval Augmented Generation) application with Amazon Bedrock and Amazon Kendra is paved with some work. Let`s get our hands dirty. With streamlit and langchain, you can quickly build a cool POC. This two-part blog is about what comes after that. - by Gernot Glawe