See what makes Kong the fastest, most-adopted API gateway
Check out the latest Kong feature releases and updates
Single platform for SaaS end-to-end connectivity
Enterprise service mesh based on Kuma and Envoy
Collaborative API design platform
How to Scale High-Performance APIs and Microservices
Call for speakers & sponsors, Kong API Summit 2023!
5 MIN READ
At the beginning of 2021, a brand new data team was assembled to build a real-time data platform for Kong’s SaaS platform, Konnect. Our mission is to provide top-notch real-time API analytics features for Konnect customers.
The initial backend consisted of three main components: an ingestion service to handle incoming telemetry requests, an intermediate message queue, and a real-time OLAP data storage.
After evaluating multiple technologies, we settled on Apache Druid for our backend data storage. For streaming ingestion, Druid supports both AWS Kinesis and Kafka. We chose to use the former as our message queue because the following reasons:
In two months, we successfully rolled out the V1 version of the platform in production.
Once we had our initial data pipeline running in production, we started working on additional capabilities to support new use cases. We introduced a new data enhancement service in our stack which was responsible for extending the original dataset with additional metadata.
As time passed on, we did encounter some pain points and limitations that prompted us to look for better solutions.
Furthermore, as Konnect became more sophisticated as a SaaS product, other teams also had new use cases that required a message queue in the overall architecture. The company reached the point where it would be more cost effective to run a Kafka cluster ourselves, so our SRE team started to work on offering Kafka as a shared service for all internal teams. We quickly decided that it was a good time for us to move away from Kinesis and switch to use the internal Kafka cluster.
When it comes to integrating AWS MSK with our Druid cluster, it requires more infrastructure setup for Kafka than Kinesis. There are two main reasons for that. First, our Druid cluster lives in a separate AWS account because we use a third party for managing our cluster, while our Kafka cluster lives in our main AWS account. Second, although MSK is a managed AWS service, the Kafka cluster is still deployed in our own VPC and additional network setup is required for our Druid cluster to access it.
There are several connectivity patterns for accessing AWS MSK clusters. In our case, we need a setup to bridge two VPCs across two different AWS accounts. We chose to use AWS Private Link as it was the most suitable solution for our network setup. Luckily, the end to end setup was well explained in an AWS blog post, so we didn’t have too much trouble to get everything wired up. We wanted to avoid modifying brokers’ advertised ports as pointed out in the post, so we picked the pattern where we fronted a dedicated interface point for each broker.
AWS IAM is used extensively in our infrastructure to enforce access control, and a dedicated IAM role was created for our Druid cluster to access a set of topics. There is an explicit option for configuring an AWS IAM role for Kinesis ingestion jobs. However, this part is not well documented for Kafka ingestion jobs.
There are two key steps for setting up AWS IAM authentication for Druid Kafka ingestion jobs.
We have the below script to install the additional jar file as part of the node bootstrap process.
MSK_JAR=aws-msk-iam-auth-1.1.4-all.jar JAR_REMOTE_REPO=https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.4 KAFKA_EXT_DIR=/opt/extensions/druid-kafka-indexing-service JAR_REMOTE_LOC=$JAR_REMOTE_REPO/$MSK_JAR JAR_LOCAL_LOC=$KAFKA_EXT_DIR/$MSK_JAR if ! [[ -f "${JAR_LOCAL_LOC}" ]] then wget $JAR_REMOTE_LOC -P $KAFKA_EXT_DIR fi
Here is a sample consumer property configuration for the Druid Kafka ingestion job.
"consumerProperties": { "bootstrap.servers": "", "security.protocol": "SASL_SSL", "sasl.mechanism": "AWS_MSK_IAM", "sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn='arn:aws:iam::<account>:role/<role name>';", "sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler" },
Our goal is having zero downtime for our services during the data migration process. To that end, we opted for a double-writing strategy where we double-write our incoming data in both Kinesis and Kafka, and created a separate Kafka data source for each corresponding Kinesis data source in Druid.
In order to make sure we have data parity between the existing Kinesis data sources and the new Kafka data sources, we followed the below migration steps for Druid:
The below two images illustrate the overall migration process for our Druid data sources.
On the frontend, we added a feature flag to switch between using Kinesis data sources and using Kafka data sources. The flag allowed us to test the new feature with our internal organizations before carefully rolling it out to all customers.
Overall the migration went very smooth for us. The whole migration process roughly took 2 months to complete from beginning to end.
We have been running our Kafka ingestion in production for some time now and found several areas we really like, including:
Share Post
Learn how to make your API strategy a competitive advantage.