Designing a Metrics Pipeline for SaaS at Scale: Kong Cloud Case Study
In this blog, the Kong Cloud Team shares their experience building the metrics infrastructure that supports the Kong Enterprise service control platform as a hosted managed service in the cloud.
Kong is a popular open source API platform, which has been widely adopted around the world. Here at Kong Inc., we believe in building performant products. Kong itself is proof of that, and Kong Cloud is no exception. We’ve been making progress toward developing a cloud provider-agnostic, scalable, hosted and managed service control platform with all the latest and greatest Kong Enterprise features. It delivers the value you would expect from a Kong-as-a-Service solution, letting you focus on building business value instead of maintaining infrastructure.
Monitoring and logging for Kong Cloud’s proof of concept
Kong Cloud is a software-as-a-service (SaaS) offering designed to handle web-scale traffic using modern cloud software. Running a SaaS is not only about running your software itself but also running all the necessary tooling to provide operational observability and efficiency so that you can keep your software up and running well for your customers.
Figure 1: The first iteration of our metrics and logging infrastructure.
We use the rate, error and duration (RED) method to ensure that our cloud services are meeting the service level objectives (SLOs) we’ve set. All traffic entering our network goes through an edge layer – our first layer of defense – where we do TCP terminations, collect metrics, perform logging and tracing and implement firewalling. Telemetry data (type, latency, status codes, errors encountered) for each incoming or outgoing request is recorded, indexed and analyzed using a time-series database and distributed tracing.
An ELK stack and Prometheus make up the logging and monitoring infrastructure respectively. Filebeat forwards all the logs from our edge layer instance. In the first iteration of our metrics pipeline, Logstash groked most of our Service Level Indicator (SLI) metrics and indexed them into Prometheus via StatsD events. Prometheus then alerted our Site Reliability Engineers (SREs) on any anomalous behavior via Alertmanager.
We noticed a problem with our initial design
This seemed like a good first stepping stone, but our stress tests said otherwise. When we compared metrics from different sources, we noticed that they didn’t always agree.
Figure 2: Inconsistent results.
Figure 2 shows the inconsistency between the metrics we collected from different sources during a stress test that we initiated. The top half of the figure shows the requests per second (RPS) at the edge layer, while the bottom half shows the number of TCP connections at the edge layer, which were collected using the Prometheus Node Exporter.
We initiated a burst of requests beginning around 17:55 that ended at 18:15, but the metrics pipeline shows that RPS climbed up from 18:00 and ended later than 19:00. This was a stress test during which our metrics dashboard was lying out loud. We could say this with certainty as the source of traffic was a wrk process in our control. The metrics we observed on the client side did not match the metrics we stored.
We started to look into our bottlenecks and quickly realized this was a bigger problem.
The lift-off approach
The root cause of our problem was our design for getting metrics into Prometheus. To gather metrics, we tweaked Logstash nodes in the ELK pipeline to mutate numeric values present within the log document (latency values, HTTP status codes, etc) into integer values and render StatsD events. The StatsD events were then sent to a Prometheus exporter which took them in, aggregated them and exposed them in Prometheus Exposition Format for Prometheus to consume and alert on.
This dependency of the metrics pipeline on the logging pipeline meant that metrics would be incorrect if the ELK stack had any failures/bottlenecks. Our simple naive design had helped us get off the ground as quickly as possible while we tested our hypothesis on the business and technical side but was not suited for production traffic.
Logs are not the same as metrics
Astute readers might have already noticed the problem with our first approach — logs != metrics. Parsing logs can be expensive and slow. When metrics are derived from logs, they are only available once the logs have been processed. This defeats the purpose of these metrics, since we need them to give us information about the real-time SLIs for our service. For our use-case starting out, it was acceptable for logs to be processed at a slower rate during surges. We never expected them to be real-time. This initial design — re-using a logging pipeline — reduced operational and development cost while we started to set up our infrastructure. It worked well to bootstrap a cluster of services for Kong Cloud but quickly reached its limits (hence, this blog post).
All shortcuts failed
Although we were aware that our metrics pipeline had to be rethought, we spent a little time to see how much we could get out of the initial design using operational and configuration changes.
Improve StatsD since it was the bottleneck
The first pass at stress testing our metrics pipeline highlighted what appeared to be a CPU bottleneck within the StatsD exporter. We spent some time studying and optimizing the exporter but still found that Key Performance Indicator (KPI) data reported by Prometheus did not align with actual traffic patterns. This nevertheless was a huge improvement for us since we still rely on some metrics associated with certain kinds of logs.
Optimize logging Infrastructure
Batching in Filebeat
After making sure that we were not saturating network links, we tuned our Filebeat configuration to make sure that we are batching logs to ship them efficiently over to Logstash nodes.
This further improved the indexing rate but there was not much that we could optimize here. The side effect of this was the metrics emitted by Logstash then spiked up and filled up the UDP buffer at StatsD exporter side periodically.
Improve ES indexing rate
We probably don’t need to get into a discussion of how Elasticsearch is a beast to just run and maintain and scale. We tuned up our instances to make sure we had a good balance of CPU and IO and had enough fire in the cluster to index at a moderate rate. We also tuned up Logstash to do aggressive batching and over-provisioned it, so that wasn’t a bottleneck.
It should be noted that all these improvements gave more than 100 percent improvement in the metrics being reported, but that was not anywhere near to what we expected our traffic rate to be. We had to improve metrics pipeline to record metrics at several thousand RPS to several hundreds of thousands RPS (with a possibility to exceed 1M+ RPS).
Since we knew we weren’t pursuing the right direction, and that optimizing wouldn’t solve the problem at hand, we needed a new homegrown design.
Our design goals
We knew that we needed to rethink how we collected our metrics at the edge. In addition to ensuring correctness and real-time delivery of our metrics, we wanted to solve a few other problems to make our infrastructure easier to scale and make it more resilient. The following are some key design goals we set out to achieve to solve our problems. These are listed in the order of their significance:
Least-possible impact on latency
Kong is designed to serve requests at high throughput and extremely low latencies. Kong adds less than a millisecond to the latency for most requests. When your core product doesn’t add much latency, there is not a lot of scope for any overhead in other supporting services. All our metrics had to be collected with the least amount of impact on latency possible.
In our previous solution, most of the pipeline had to be scaled up linearly as the amount of traffic increased. This was a major operational and cost burden. The new design should not be linearly proportional to the traffic rate. It should scale horizontally along with our edge-tier. One could scale Elasticsearch to a million indexing rate but that’s a cost we are not ready to pay in terms of infrastructure and keeping people around whose sole job is running Elasticsearch.
Minimize blast radius
Separation of our logging and metrics infrastructure and pipeline was another key goal for us. Specifically, a degradation or failures within the logging pipeline should have no impact on the metrics pipeline, and vice versa. With a microservices architecture, it is easy to end up in a dependency hell. We still actively try to keep blast radii of all components to as small as possible.
While we were balancing all of the above concerns, we came up with a few ideas about how we could architect the new solution. We spent time thinking about running Logstash parser only processes alongside NGINX, developing a plugin for Logstash to export metrics more efficiently but these all seemed like operational/configuration hacks as well, which didn’t address the root problem. We had to record the metrics in the application itself — in this case, NGNIX — for a comprehensive picture and a performant solution.
The solution that finally worked came out of Kong itself. We realized that we could use OpenResty (Kong is built on top of OpenResty) to solve all of these problems. We could write Lua code to simplify the data flow by directly sending metrics to Prometheus and avoiding the expensive and slow log parsing and StatsD events.
Figure 3: Our rearchitected metrics pipeline.
To implement our solution we replaced NGINX with Openresty on the edge tier and used the lua-nginx-module to run Lua code that captures metrics and records telemetry data during every request’s log phase. Our code then pushes the metrics to a local aggregator process (written in Go) which in turn exposes them in Prometheus Exposition Format for consumption by Prometheus. This solution reduced the number of components we needed to maintain and is blazingly fast thanks to NGINX and LuaJIT. We will be doing a deep dive into the design and code of our solution in the next blog post so stay tuned for that!
The above solution not only met all our design goals but opened up a gateway to monitor our edge tier itself much better.
Show me the data!
Figure 4 is the same graph as Figure 1 — only now we can see that the metrics reported by our pipeline are same as what we see on the client side.
Figure 4: Consistent results.
Our solution has been running in production for over two months. Our logging infrastructure frequently lags behind in indexing but we are extremely satisfied with the responsiveness of our metrics pipeline.
The code we designed and optimized is not currently generic enough to use for other applications; it caters to our very specific needs and is tightly coupled with how we handle multi-tenancy in Kong Cloud. But our approach itself can be ported for other use cases running at scale.
Instrumenting applications with metrics infrastructure is an investment that pays for itself pretty quickly with the observability that it provides. It should be done early on rather than as an afterthought.
This post is the first in a three-part blog series around Kong Cloud’s metrics pipeline design and implementation. In part two, we will dig into the meat of our implementation and describe how we aggregated metrics with a combination of push and pull approaches to get the best of both worlds.
Till then, may your pager be quiet!
Sign up for updates on Kong Cloud