From Kinesis to Kafka
Real-time Data Analytics Platform from Zero to One
At the beginning of 2021, a brand new data team was assembled to build a real-time data platform for Kong’s SaaS platform, Konnect. Our mission is to provide top-notch real-time API analytics features for Konnect customers.
The initial backend consisted of three main components: an ingestion service to handle incoming telemetry requests, an intermediate message queue, and a real-time OLAP data storage.
After evaluating multiple technologies, we settled on Apache Druid for our backend data storage. For streaming ingestion, Druid supports both AWS Kinesis and Kafka. We chose to use the former as our message queue because the following reasons:
- We prefer fully managed services to minimize operational overheads
- We wanted to have a solution running in production quickly, and it is easier and faster to set up Kinesis.
- Some team members had prior experience of using Kinesis
In two months, we successfully rolled out the V1 version of the platform in production.
The Inception of Kafka Migration
Once we had our initial data pipeline running in production, we started working on additional capabilities to support new use cases. We introduced a new data enhancement service in our stack which was responsible for extending the original dataset with additional metadata.
As time passed on, we did encounter some pain points and limitations that prompted us to look for better solutions.
- Compared to Kafka, Kinesis doesn’t support transaction write across multiple topics, without which it is impossible to fully implement exactly once semantics.
- We use Go for building our backend service, but there are not many actively maintained Go Kinesis libraries in the community. In the end, we chose Kinsumer after comparing different options.
- Kinsumer provides two options for storing Kinesis offsets, DynamoDB and RDS. Despite it is very easy to set up DynamoDB, we still prefer not managing additional infrastructure resources.
Furthermore, as Konnect became more sophisticated as a SaaS product, other teams also had new use cases that required a message queue in the overall architecture. The company reached the point where it would be more cost effective to run a Kafka cluster ourselves, so our SRE team started to work on offering Kafka as a shared service for all internal teams. We quickly decided that it was a good time for us to move away from Kinesis and switch to use the internal Kafka cluster.
AWS MSK Integration
When it comes to integrating AWS MSK with our Druid cluster, it requires more infrastructure setup for Kafka than Kinesis. There are two main reasons for that. First, our Druid cluster lives in a separate AWS account because we use a third party for managing our cluster, while our Kafka cluster lives in our main AWS account. Second, although MSK is a managed AWS service, the Kafka cluster is still deployed in our own VPC and additional network setup is required for our Druid cluster to access it.
MSK Private Link Setup
There are several connectivity patterns for accessing AWS MSK clusters. In our case, we need a setup to bridge two VPCs across two different AWS accounts. We chose to use AWS Private Link as it was the most suitable solution for our network setup. Luckily, the end to end setup was well explained in an AWS blog post, so we didn’t have too much trouble to get everything wired up. We wanted to avoid modifying brokers’ advertised ports as pointed out in the post, so we picked the pattern where we fronted a dedicated interface point for each broker.
MSK Authentication Setup
AWS IAM is used extensively in our infrastructure to enforce access control, and a dedicated IAM role was created for our Druid cluster to access a set of topics. There is an explicit option for configuring an AWS IAM role for Kinesis ingestion jobs. However, this part is not well documented for Kafka ingestion jobs.
There are two key steps for setting up AWS IAM authentication for Druid Kafka ingestion jobs.
- Installing the additional AWS MSK IAM jar on Druid nodes. The jar needs to be placed inside the Druid kafka indexing service.
- Configuring the Kafka ingestion job to use the right IAM role
We have the below script to install the additional jar file as part of the node bootstrap process.
MSK_JAR=aws-msk-iam-auth-1.1.4-all.jar
JAR_REMOTE_REPO=https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.4
KAFKA_EXT_DIR=/opt/extensions/druid-kafka-indexing-service
JAR_REMOTE_LOC=$JAR_REMOTE_REPO/$MSK_JAR
JAR_LOCAL_LOC=$KAFKA_EXT_DIR/$MSK_JAR
if ! [[ -f "${JAR_LOCAL_LOC}" ]]
then
wget $JAR_REMOTE_LOC -P $KAFKA_EXT_DIR
fi
Here is a sample consumer property configuration for the Druid Kafka ingestion job.
"consumerProperties": {
"bootstrap.servers": "",
"security.protocol": "SASL_SSL",
"sasl.mechanism": "AWS_MSK_IAM",
"sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn='arn:aws:iam::<account>:role/<role name>';",
"sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler"
},
Zero Downtime Live Data Migration
Our goal is having zero downtime for our services during the data migration process. To that end, we opted for a double-writing strategy where we double-write our incoming data in both Kinesis and Kafka, and created a separate Kafka data source for each corresponding Kinesis data source in Druid.
Druid Data Migration
In order to make sure we have data parity between the existing Kinesis data sources and the new Kafka data sources, we followed the below migration steps for Druid:
- Pick a specific date X as our cutoff point
- Drop all the existing data for Kafka data source until X. This can be easily achieved in Druid by dropping all the segments before X.
- Replicate all data from Kinesis data source to Kafka data source until X. We ran a single batch indexing job for data migration for each data source.
The below two images illustrate the overall migration process for our Druid data sources.
Feature Flag Controlled UI Rollout
On the frontend, we added a feature flag to switch between using Kinesis data sources and using Kafka data sources. The flag allowed us to test the new feature with our internal organizations before carefully rolling it out to all customers.
Overall the migration went very smooth for us. The whole migration process roughly took 2 months to complete from beginning to end.
Into to the New Era
We have been running our Kafka ingestion in production for some time now and found several areas we really like, including:
- Kafka-go from Segment is much better than Kinsumer library. It has more active development and community support.
- Kafka-UI allows us to examine each individual message of any Kafka topics, which is a productive boost in the development phase.
- For local development, we don’t have to depend on an external service like AWS Kinesis, since we can spin up a local Kafka cluster easily.
- Better resource utilization. Kinesis is charged based on the number of shards for each used Kinesis stream, while MSK cost mostly depends on provisioned compute and storage regardless how many topics and partitions you have. As a result, we have realized higher resource utilization and achieved better engineering efficiency by leveraging the multi-tenancy nature of AWS MSK.