In our last blog post in this series, we discussed our journey designing a metrics pipeline for Kong Cloud to ensure the reliability of our SaaS offering. We discussed how we re-architected our production data pipeline using OpenResty to send metrics to Prometheus and saw huge performance gains. We are now able to monitor high traffic volumes in our system using much less compute power, lowering our costs. Decoupling metrics and logging infrastructure has helped us evolve and scale the two systems independently.
In this blog post, we will discuss the production metrics pipeline architecture in detail - covering how metrics are collected, aggregated and processed. We talk about the factors we considered while deciding where to use push and pull philosophies for passing metrics between the components of our pipeline.
We use OpenResty at the Kong Cloud edge to collect request metrics amongst other things. OpenResty executes logic in the context of various "phases" that run throughout the lifecycle of an HTTP request. These phases allow Lua logic to run before, during and after the HTTP request is handled by NGINX. In our edge tier, OpenResty worker processes extract metrics during the log phase of each request they handle by executing Lua code. Workers store these metrics in the local memory of the Lua Virtual Machine (VM). Specifics of the various phases of OpenResty execution are detailed in the lua-nginx-module documentation.
In NGINX with multiple worker processes, each request is handled by one of the worker processes. This means that requests for metrics from a system like Prometheus would be handled by only one of the worker processes, and it would be hard to predict which one that would be. Because each worker process collects metrics on its own, a worker process will only provide data stored in its own memory space and not from other workers.
If the workers could be kept synchronized, it wouldn't matter which one was serving metrics requests. However, the event-driven model of NGINX makes it hard to synchronize all the workers without slowing down the current request, since worker level synchronization requires establishing a mutex across all workers in order to update shared memory.
Another solution would be to aggregate and store metrics in a shared memory zone (a shared dictionary in OpenResty), apart from local worker memory. Aggregating the metrics into this zone presented a problem - highly concurrent levels of traffic. A substantial amount of time was wasted on lock contention as workers contended to update the same statistics.
At first, we tried to work around the architectural limitations of shared memory. We examined two possibilities: we could either increase the counter in shared memory for each request or let workers writeback to the shared memory on a specified interval. Both of the solutions led to performance penalties since each operation on the OpenResty shared dictionary involves a node level lock and thus slowed down the current request. After considering this, we concluded that shared memory using OpenResty shared dictionaries wouldn't be the optimal solution.
A smarter design called for a way to collect metrics from within each worker, without spending cycles on lock contention and without requiring workers to sync their statistics with each other on a periodic interval to be polled by Prometheus.
Metrics systems can be based on either a push or pull model. In our experience, the model to use for monitoring is highly dependent on the type of workload and architecture of the application being monitored. To monitor each of our software components and the infrastructure the component lives on, we needed to tailor our solution to collect metrics effectively using either push, pull or a mixture.
We use Prometheus for monitoring our infrastructure and services. Prometheus is designed with the pull philosophy in mind. It scrapes metrics and needs extra tooling to connect to components that actively push metrics out. Any system that includes it must either rely exclusively on pull, or mix push and pull strategies.
In a pull model, the metrics server requests metrics data from each service within the infrastructure, extracts the metrics and indexes them. Each monitored service provides an interface exposing its metrics for the system to consume.
This requires that the monitor and the services being monitored agree on a pre-existing endpoint by which to serve metrics and requires the monitored service to spend cycles calculating, aggregating and serving metrics data via an additional application interface. For Prometheus, this interface is an HTTP endpoint that serves metrics in exposition formats that Prometheus scrapes on a configured interval.
The cost of additional complexity on the monitored component is balanced by greater flexibility for the metrics server; selectively fetching and ingesting metrics allows a monitoring server to more intelligently handle service overloads, retry logic and selective ingestion of data points. Another advantage of a pull model is that the component doesn't need to know the presence of the monitoring component. Prometheus can use service discovery to find the components it needs to monitor.
Prometheus prefers pull over push generally, but that doesn't necessarily mean we should prefer pull over push when designing our metrics infrastructure.
In the traditional push model (leveraged by tools like StatsD, Graphite, InfluxDB, etc.), the service being monitored sends out metrics to the metrics system itself. Typically this is done over a lightweight protocol (such as a thin UDP wire format in the case of StatsD) to reduce the complexity and load requirement of the monitored system. This is much more straightforward to set up since the monitored service only needs to know about the address of the metrics system in order to stream metrics to it.
Prometheus doesn't support a "pure" push model because components can't write to it directly. To monitor push-model components with Prometheus, the component being monitored actively sends metrics to middleware, which Prometheus then actively scrapes. The middleware essentially acts as a "repeater" that accepts push from a monitored component. Depending on the data format the component is pushing, the "repeater" can be a Pushgateway or StatsD Exporter.
We designed the model we use to be a mix of push and pull, as shown below:
The edge logger is a piece of Lua code that runs on each worker. It has two responsibilities:
Collect metrics for the current request and store them in local worker memory during the log phase
Push the metrics to an edge exporter on a specified interval
The edge logger does worker-level aggregation.
The edge logger collects metadata for every request being proxied. The data can be classified into two data types: counters and histograms.
Counters are incremented whenever an event occurs. At the edge, we record a few metrics which are counter-based:
Number of requests by status codes, components and customers: This information would help us drill down into which customers were experiencing issues if a new roll out ever caused issues. We also count HTTP status codes returned by Kong by customer's upstream and the edge. If these start to diverge, we know there is a problem with one of the components. For example, if a request is being terminated at Kong with a 500 status code, that means that Kong is throwing an error, but if the 500 comes from the customer's upstream server, then the customer needs to take action to resolve it.
Transit egress/ingress: We count the number of bytes sent and received for each customer and type of service (Proxy, Kong Manager, Kong Portal, etc).
Histograms are generated by bucketing events and then counting the events in each bucket. We use them to record timing statistics by configuring buckets of durations, classifying each request in one of those buckets and incrementing a counter for each bucket. A few examples of metrics that we record as histograms are:
Total request duration: We keep track of how long requests are taking from start to finish
Latency added by Kong: This helps us make sure that Kong is performing with acceptable overheads
Latency added by Kong Cloud: This is one of the important service level indicators (SLIs) for us. It indicates how much latency we're adding to requests and allows us to continuously tune and optimize our platform.
Performance tuning and metrics storage
To gain better performance and improve resource utilization, we use a foreign function interface (FFI) to create a C level struct for metrics like latencies and error counters. When the edge logger takes the JSON config of all metrics it's configured to collect, it will also generate the FFI definition of the C struct.
We also use the LuaJIT function table.new to preallocate the table needed to store schemaless metrics like status codes. Since we don't care about rare status codes like 458 or 523, we group all status codes except for those that are vital to us into groups. For example, all status codes larger than 199 and smaller than 300 will be collected at 200, 201, 204 and 2xx. This allows us to much more accurately predict the memory usage and CPU overhead when inserting and updating the value in the table.
Metrics like request size or response duration are stored in NGINX variables. To access them with the logger, we need to use the ngx.var Lua table which holds those metrics. To avoid unnecessary ngx.var lookups, we use a consul-template to render the nginx.conf so that variables like monitored component are hardcoded into the Lua code parameter that calls the logger.
As is discussed before, due to the NGINX worker model, each worker process holds its own instance of metrics data. To aggregate those data across all the workers, each logger actively pushes data from its worker to an exporter. When each worker sends its local metrics data, it also attaches the worker_id value so that the exporter can distinguish the source of metrics.
The exporter is a small application running alongside the OpenResty worker processes. It works similarly to the Pushgateway. It is configured with a config file, defining the metric names and types it is expected to receive from the OpenResty worker processes. The edge exporter supports counters, gauges and histograms at the moment. It translates the metrics it receives from OpenResty worker processes using the Golang client library by Prometheus.
The edge exporter listens to each OpenResty worker process on a port on its loopback interface for metrics. Periodically, each edge logger sends all the metrics it has gathered since its worker process started to the edge exporter. The exporter completely replaces all its metrics for each worker process whenever it receives a new update from it. On the surface, this might seem inefficient, but in reality, this design has been fruitful since this essentially makes edge-exporter stateless and easy to maintain.
We've found that skipping metrics which are essentially empty or zero can help reduce the load on Prometheus. For example, if we are tracking server errors (500 HTTP status codes), we expect them to be zero in production most of the time, so we skip exporting metrics which have zero values. If there's a lot of such "zero" value metrics, then skipping such metrics reduces CPU consumption on Prometheus, since those metrics don't need to be time-stamped, recorded and indexed.
Counter-metric resets can and do occur when OpenResty worker processes are restarted/reloaded. Prometheus queries and functions handle these appropriately in most cases and we don't do anything special on the Edge Exporter side to handle counter resets.