Reliable Data Streaming on AWS
This content is more than 4 years old and the cloud moves fast so some information may be slightly out of date.
Reliable Data Streaming on AWS
We should agree that in our digital world streaming and especially data streaming becomes more and more important if it isn’t already. Besides performance/throughput and security for a data streaming system reliability is a major point to be considered.
Today we have multiple tools and software platforms that promise to handle streaming data for us, if they are configured accordingly, we might agree that this configuration is not the easiest task to handle when it comes to setup and operations of a data streaming platform. With this blog post we will introduce to platforms within the AWS Cloud that make reliable data streaming a convenient task, namely Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Kinesis.
Data Streaming Platform Configuration Trouble
Before we elaborate why managed services like Amazon MSK and Amazon Kinesis gives us a speed advantage when it comes to building a reliable data streaming platform, we first need to discuss typical tasks and infrastructure we need to manage when we want to build our platform from scratch.
Apache Kafka or Managing Two Clusters
Apache Kafka is a widely known data streaming platform and might be used synonymous with data streaming platforms today. What many don’t know when it comes to Apache Kafka that the needed infrastructure and configuration is a cumbersome task and might not be an easy job. Because in order to operate a reliable Apache Kafka Cluster we also need to operate an Apache ZooKeeper Cluster to manage our Apache Kafka Broker instances. In short, we need to manage and configure two clusters to build a reliable data streaming platform based on Apache Kafka. At this point we all know that cluster management is not always a zero-maintenance job at all.
In the following architecture diagram, we will see how many components we need to manage, configure and operate to build a reliable data streaming platform with the core being Apache Kafka. As this diagram shows we need to have at least six instances to manage, and we are not talking containers in this post because if we would do so the complexity would even rise, and we would need to manage that docker cluster as well.
Disclaimer: This could be also managed using AWS services like Amazon ECS or Amazon EKS, nevertheless there would be another cluster in such a solution.
Apache Kafka - Configuration Considerations
As the infrastructure diagram already shows we have a lot of work in front of us and configuration of Apache Kafka is no short coming in this case. We need to consider the performance requirements for our data streaming platform and therefore need to configure and tune our platform in the following areas:
- Linux Filesystem Configuration and Tuning
- Java Garbage Collection Configuration and Tuning
- Hardware Specifications of the block devices
- Network Throughput and Connectivity
Once we tuned our Apache Kafka Cluster for performance, we might also need to configure our cluster for reliability, as this is the main part of this blog post in the first place. As you might guess when tuning for reliability, we need to make some compromises on performance of our cluster. For reliability we need to considered not only the following configuration options:
- Replication Factor within the Cluster or on the Topic
- Rack Awareness Configuration of our Brokers
- Number of min. In Sync Replicas including replication lag
- Unclean vs. Clean Leader Election
- Message Acknowledgement of the Cluster
And once we figured all these settings according to our needs, we are ready to automate the deployment of our individual nodes. This includes infrastructure as code definitions as well as configuration management to be applied on top of the provisioned infrastructure in order to get rid of almost every manual operation task. This could be a nice adventure and we might learn a lot along the way, because until we get a production ready solution, we will fail a lot.
Note: I’m not saying you shouldn’t do it for the pure purpose of education, nevertheless if you want to focus on your application and speed of deployment you should follow the next chapters on AWS managed Data Streaming Platforms
AWS Managed Services for Help
AWS gives us two managed options to build a reliable data streaming platform. The following chapters will give you a quick introduction on those two platforms and lead you to our basic IaC implementations to get you quickly started with a test deployment to test and evaluate those services for your application integrations.
Amazon MSK - The Easy Apache Kafka Deployment
We have Amazon MSK which in fact does all those considerations mentioned above around a reliable Apache Kafka Cluster for you and follows best practices for the setup. And as a bonus AWS will completely manage the infrastructure for you and provide you with managed endpoints where you can quickly connect to and get started with your application development or integration. To learn more about Amazon MSK and its features you might want to look up the documentation or reach directly out to me.
Note: Just today AWS also announced support for Kafka Version 2.7.0 on Amazon MSK Clusters. As of now Amazon MSK supports Versions 1.1.1, 2.2.1 and 2.7.0 (subject to change when you revisit this blog in the future, stay up to date with AWS Announcements or contact us).
To get started with our sample code just clone our public repository on GitHub and deploy the Infrastructure to your AWS environment in almost no time at all.
Note: For deployment instructions just follow the README file within the repository.
Amazon Kinesis - The Cloud Native Data Streaming Service
AWS offers us with Amazon Kinesis another option for our data streaming platform that is native to the cloud and especially AWS. It is a fully managed service build for data or video streaming by AWS.
Note: We will focus on the data streaming capabilities of Amazon Kinesis within this blog post named Amazon Kinesis Data Streams.
It also comes with additional services and features like Amazon Kinesis Data Firehosefor easy streaming data ingestion into your data lake or data warehouse for batch data analytics. Another feature would be Amazon Kinesis Data Analytics which it allows you to directly execute SQL queries on your data streams.
For now, we will focus on the data streaming capabilities of Amazon Kinesis as these can be mapped to Apache Kafka counterparts. Therefore, we need to quickly introduce the Amazon Kinesis concepts in the following list to give you an idea how you can work with Amazon Kinesis for your streaming applications.
- Retention Period: Time a data record is stored within the Kinesis Data Stream
- Shard: A Kinesis Stream is composed of one to multiple shards, and the performance of the stream is scaled based on the number of shards assigned to it.
- Reads per second: up to 5
- Read Data Size per second: up to 2MB
- Writes per second: up to 1000
- Write Data Size per second: up to 1 MB
- Partition Key: Amazon Kinesis uses partition keys to group data records into shards. The producer needs to provide the Partition Key while putting a data record into the stream.
The closest we get when comparing Apache Kafka to Amazon Kinesis is the mapping of a Kafka Topic to a Kinesis Data Stream, where the later scales performance according to the previously described shard system. As Amazon Kinesis is a fully managed service AWS handles the needed infrastructure to support your provisioned performance as well as reliability for a given Kinesis Data Stream.
Conclusion
When it comes to building and operating a reliable data streaming platform on the AWS cloud, we are faced with three major options to do so:
- Build up an Apache Kafka cluster from scratch, including all the configuration effort as well as the effort to additionally manage a second cluster of Apache ZooKeeper within our environment as well.
- Use Amazon MSK to provision a managed Apache Kafka cluster where AWS handles all the heavy lifting on setting up and operating the needed infrastructure. Leaving us with the configuration of Apache Kafka and fine tuning the AWS presets to our needs.
- Use Amazon Kinesis Data Streams as a cloud native alternative to build streaming applications. Where we simply provision the whished performance for our application and let AWS handle everything from the infrastructure over the configuration for us.
Note: With Amazon Kinesis the Authorization and Authentication to your Data Streams is integrated and Handled by AWS IAM so there is no individual Kinesis Data Stream Endpoint for your individual streams but to interact with your data streams you will use the AWS API. If you want to obtain private connectivity from your VPC to the Amazon Kinesis API a VPC Interface Endpoint is available as well.
With these three options it comes down to trying them out and decide which is the best solution for our specific use case. And the good part is you can find a simple deployment directly from our GitHub Repository to get you started and easily teardown the infrastructure after you are done with testing. If you have any further questions regarding Data Streaming on AWS or are building a new application and are not sure which way to go you can also reach out to me and we can have a quick chat about your data streaming needs.