Building End to End Observability in Kong Konnect Infrastructure
As infrastructure becomes more and more distributed, building better observability around it is becoming crucial. With the emergence of microservices architecture, teams want to gain better visibility with proper observability built into the architecture.
In this blog post, we’ll explain how we unified the three pillars of observability at Kong to build a better observability platform around Kong Konnect’s infrastructure.
Kong Konnect is an end-to-end SaaS API lifecycle management platform designed for the cloud native era. It provides the easiest way to get started with Kong Gateway, the world’s fastest and most adopted API gateway. The global control plane is hosted in the cloud by Kong, while the runtime engine — Kong Gateway — runs within your preferred network environment.
The entire Konnect infrastructure runs in public cloud, and we’ve embraced microservices architecture from day one. As Konnect grows, more and more services are emerging and building a scalable observability platform is one of our top priorities.
We always wanted our engineering and CRE teams to quickly troubleshoot issues and mitigate them before they became widespread. We’re also expanding to multiple geographies and soon will be supporting multiple regions in all geographies. So observability of the entire infrastructure is a key thing.
Figure 1: Konnect Infra regional overview
The 3 Pillars of Observability
Metrics, Logs, and Traces are often classified as the three pillars of observability. By connecting these three pillars, we could easily identify issues faster. We’ve done a lot of work to connect these three pillars to build a unified observability platform for Konnect using Datadog.
It all starts with tags.
All the Datadog agents running in our EKS clusters have default tags (like AWS region, environment name, etc.) added to every telemetry that it collects and ships to Datadog, whether it’s logs, metrics, traces, or events. This helps teams to navigate around the telemetry data that they access via Datadog UI or for building appropriate monitors around different production regions across geographical locations.
On top of the infra level tags, we add metadata tags. These include service mesh specific (from which logical mesh the telemetry came from) and workload specific (service metadata tags like service name, service version tags, etc.) to identify telemetry sources.
Figure 2: How we add trace ID to the Kong Gateway access logs
Figure 3: How we add trace ID to Kong Mesh Dataplane access logs
Figure 4: How we correlate traces and logs together to view the entire lifecycle of an incoming user request
Metrics for better runtime visibility
In Konnect, we collect runtime metrics from every entity — whether its customer-facing LB/Kong Gateway, or service mesh, or our backend entities like our Aurora databases. We then use these metrics for a variety of use cases like (1) building overview dashboards to see how different components perform, (2) creating infra/service alerts, and (3) SLO dashboards for our weekly service reviews. All the metrics are ingested with necessary metadata tags as mentioned above.
Kong Mesh overview dashboard
Konnect Kong Mesh SLO dashboard
Tracing the traffic
In Konnect, we use distributed tracing to view the entire lifecycle of a request. This helps us to find the bottlenecks.
We have tracing enabled on every layer of our infrastructure starting from the Kong Gateway all the way to our service mesh (which traces the east-west traffic) and from the services directly. Engineering and CRE teams are already leveraging the distributed trace to debug customer as well as internal connectivity issues which helps them to quickly identify issues and mitigate them.
Building an end-to-end observability platform is a key requirement for every distributed infrastructure. It helps the teams to monitor, investigate, and capture issues faster before they cause widespread impact to customers. By connecting the three pillars of observability, we can make debugging issues faster (and fun).