Speed up your Multi-Channel GenAI Solution - Tips from real projects



Your agentic multi-channel, multi-modal GenAI solution idea may be buzz-word-compatible. But to do some expectation management: It will not run as fast as this dog. So, we share some of the tips from our projects on how to speed up your GenAI solution.

Create an automated Test framework

As you will change many parameters on your way to a production-ready solution, you should have a test framework in place that allows you to test your changes reproducibly. This will help you identify the impact of your changes and ensure that you do not introduce regressions. Some people like to play around in a Jupiter notebook and _think _ they have come up with the best solution. But you have to know which changes led to which results. We have seen in projects that sometimes small changes, e.g. in prompts, led to huge differences.

test-cycle

To include business, Prompt-Engineers and developers alike while getting data-driven insights, this is an example of the test cycle we use:

  1. Queries or example chats are written in user-friendly Excel sheets This way, no one has to deal with CSV or even code.

  2. The configuration, which means all prompts, temperature, model ID and other parameters, are stored in a configuration file. So, it is reproducible and can be versioned.

  3. For each test run 1) and 2) are stored in a repository and get a tagged version ID in the repository.

  4. The test is run .

  5. The automation framework calls Bedrock, the agents the flows or direct comprehend APIS.

  6. The results are stored with a reference to the version ID.

Your results are now: A) All chat history or just answers B) The timing of this very configuration

In following this process, you can easily compare different configurations and results

Know thy channel

The main communication channels used are:

  1. Synchronous: Voice - yes, piking up the phone is still a thing

  2. Semi-synchronous: Chat in any way. Can be a chat widget on your company website, WhatsApp, Facebook…

  3. Asynchronous: Email, Fax, Letter As not all customers are as tech-friendly as the reader of this post, snail mail and fax are still a thing.

The timing criticality of the channels is different.

Voice timing criticality

Imagine you talk to someone behind a wall, ask a question and wait for an answer. How long will you tolerate the silence?

You can’t compare it to a normal face-to-face conversation, because you get no visual hints of “I am still thinking”.

The critical delay point here is about 3 seconds. So this is the time your GenAI solution has to answer a question.

Chat timing criticality

Here you have some more time to wait for an answer. You can wait for up to 10 seconds if it is an interactive chat.

If it’s clear to the customer that it is a long-running process, you can wait longer. That depends on the type of customer.

Mail timing criticality

Just take your time.

Choose Model

That’s an easy one. And also a hard one.

The more complex your GenAI task is, the more capable the model has to be.

For instance, embeddings can also be created in German with Amazon Titan, which is fast. But to answer complex questions, you have to use a model capable of those tasks, like Claude sonnet.

This is why the test framework is vital to have. Change models and see the impact.

Use a progress bar

It’s not about the real-time delay but about the perceived time delay. An “I am thinking” bubble on a chat buys you a few seconds. Be cautious with real progress bars if you do not know the exact time it will take. A bar which stays at 90% for a long time is quite annoying.

As for voice, you also can have a voice progress bar. That can be like “we are processing your request.” or just “please wait a few seconds”.

Use cross region inference

See the Bedrock user guide: Increase throughput with cross-region inference. Cross region inference distributes the traffic to different regions.

We have seen a speed-up of 50% in some cases.

Speed up Lambda

To prevent unnecessary local optimisation, at first you have to know where the bottleneck is. If - like in a voice chat - any second counts and you have multiple Lambda functions, cold start really can be a problem.

Python is the most used language in Lambda, but it is not the fastest. So the new “snap start” could be a solution.

Or - translate your python code to GO with the help of Gen-AI. As I have tested here, you can save up to 100ms per lambda call. With asynchrounous calls, you do not have to care about this small amount of time. But with several Lambda and synchronous calls, this can make a difference.

Test agents

Agents are a promising technology to create a more human-like conversation. But they can be slow.

A trace from the AWS Blog Introducing multi-turn conversation with an agent node for Amazon Bedrock Flows (preview) from January 22, 2025 shows that the agent node can be slow.

trace Some of the round trip times are over 7 seconds, which is too long for a voice chat.

As you see e.g. in Amazon Bedrock Serverless Prompt Chaining, Bedrock flows are a user friendly way to create agents. But with step functions with a chain of lambda calls, you can speed up the process.

But remember the quote:

Premature optimization is the root of all evil.

So first, solve your solution and then optimize.

Shorten prompts

This is the most vital point. Junior Prompt engineers talk to models like they are humans. But as its an algorithm, you have to be precise and you do not have to be polite. So “Please dear model, I have a problem which i would like to discuss with you” is not the way to go.

Also use models to optimize your prompts. Use bullet points, use xml to seperate data and test (whith the test framework) the difference.

One more tip: You can also change the token length. This can have a huge impact on the speed.

Conclusion

These are just some of the insights we have gained in our projects. Depending on the UseCase there are lots of possibilities to speed up your GenAI solution.

Contact us to develop, optimize, and run GenAI solutions that are fast and reliable! Enjoy building!

See also

Disclosures

This post has partly been supported by GenAI.

Similar Posts You Might Enjoy

Improving Accessibility by Generating Image-alt texts using GenAI

In this article, we’ll be using GenAI to generate alternative texts for images in Markdown documents, which will help people relying on screen readers to access your content. - by Maurice Borgmeier

Changing of the Guards - GenAI pattern to Bedrock service

10th of Juli: The ten new features, which were announced in AWS NY Summmit, show a trend in Amazon Bedrock: to implement Prompt Engineering Patterns as services. One of the best practices to avoid prompt injection attacks is GuardRails. Here, I do a deep dive into the new GuardRails features “contextual grounding filter” and “independent API to call your guardrails.” Note: Guardrails work ONLY with English currently. - by Gernot Glawe

GO-ing to production with Bedrock RAG Part 2: Develop, Deploy and Test the RAG Backend with SAM&Postman

In part one, we took the journey from a POC monolith to a scaleable two-tier architecture. The focus is on the DevOps KPI deployment time and the testability. With the right tools - AWS SAM and Postman - the dirty work becomes a nice walk in the garden again. See what a KEBEG stack can achieve! - by Gernot Glawe