Kong Konnect Advanced Analytics: Running Faster Than StatsD
Using Konnect Advanced Analytics for a faster real-time measurement of what your users are experiencing
Earlier this year the Kong Konnect Analytics team was looking to leverage the stability and flexibility of our own Kong Gateway to handle the entire load of our Analytics ingest firehose. Almost all Konnect API traffic flows through a Kong Gateway and most of it has from the very beginning. For some legacy and differing protocol reasons, our Analytics ingest service was not included in our initial Kong Gateway configuration. Konnect Analytics handles multiple billions of requests every day, and not needing to rely on AWS Load Balancers was going to simplify our infrastructure and be a tremendous money saver.
Before the migration began, we opened up our third-party monitoring dashboard tool so that we could sanity-check our changes through the migration. Since we were moving from AWS to Kong Gateway, we wanted to use our own tools and metrics to watch our deployment since Cloudwatch and Cloudmetrics metrics are notoriously slow and low fidelity.
With a single terraform apply we flipped our routing from using AWS Load Balancers in our target region to our Kong Gateway endpoint. We sat watching and waiting for our third-party monitoring dashboard to populate and update. After about a minute or two of waiting, we saw traffic coming through the metrics indicating that routing had switched over.
Of course, few complete route migrations go exactly as planned the first time around.
While most incoming requests were being serviced correctly, we saw a small percentage of requests were hitting unexpected errors — about one to two errors every 10 seconds. Our third-party monitoring dashboard only updated once or twice a minute and when debugging, it was painful to wait for what felt like ages before checking if our changes worked or not.
As the engineers were searching around for a more low-latency metric to determine if we were healthy yet in our observability stack, our engineering manager piped up: "I'm extremely certain we are still experiencing issues," he said.
When asked how he knew, he shared his screen showing our own Konnect Advanced Analytics graph. It showed updated data within 10 seconds of changes being applied. Not only did updates appear in Konnect Analytics faster, but our Kong Gateway is the first thing an external request sees. In other words, the Kong Gateway is an accurate measurement and representation of what our users are seeing. If we saw one or two errors popping up every 10 seconds, then we knew for certain that some user or Kong Gateway was seeing that exact error.
After debugging and diving in, we discovered that a single connection was responsible for the errors that appeared every 10 seconds. We narrowed in and found a bug in our Kong Gateway configuration and rolled back.
At this point in time, no one was looking at our infrastructure observability dashboard anymore. We were all looking at the Konnect Advanced Analytics pages to determine if we were back in a healthy, non-degraded state again. If the errors stopped appearing in Advanced Analytics, then we were confident that we had reverted back into a healthy state and all of our connections were healthy again.
In our incident post-mortem, we rebuilt the story using metrics we had collected in our observability tool. This proved to be more complex than anticipated due to the number of similarly named metrics being output from different services in our cluster. "Where was this metric being measured? Is this metric counting inter-cluster requests?" Sometimes it required digging into actual application code and searching for the exact name of the metric to be able to know its origin.
Network topology is also a key piece of information missing from a metric's context. An API request terminated early is never forwarded to the upstream application and therefore is never recorded by the application outputting the metric. The nice thing about measuring request metrics at the edge of our cluster is that it is trivial to know the experience our users are seeing.
Because the gateways we ship here at Kong are built for stability and resilience to network failures, none of our customers suffered from degraded service nor suffered from Analytics data loss. The Kong Gateway is meant to work in periods of network, or Konnect service instability. After our endpoint was fully available once again, all the Analytics data that had accrued during our degraded availability was delivered to Konnect.
While our actual migration ended in a rollback and felt like a minor setback, we feel energized not only because what we have built is so robust, but also because we had solid evidence that we were operating with faster real-time observability compared to other tools out there. When data loss or downtime is your punishment, you want a real-time system that can tell that story as fast as possible.
Unleash the power of APIs with Kong Konnect
