By on August 13, 2019

Building Metrics Pipeline for High-Performance Data Collection

This is part three in a series discussing the metrics pipeline powering Kong Cloud.

In previous posts in this series, we’ve discussed how Kong Cloud collects, ships, and stores high volumes of metrics and time-series data. We’ve described the difference between push and pull models of collecting metrics data, and looked at the benefits and drawbacks of each from a manageability and performance perspective. 


This post will detail how our engineering team studied the integration of our metrics collection code with the services that actually process customer traffic, ensuring that our data collection was effective, correct, and most importantly, did not interfere with user traffic in terms of performance or latency.

Kong Cloud’s platform is powered at the edge by a tier of OpenResty servers. To briefly recap, OpenResty is a bundle of the venerable Nginx web server, and LuaJIT, providing a stable, high-performance web application and proxy platform. Leveraging OpenResty at the edge allows us to write complex business logic (e.g., data transformation, communicating with third party services, etc) in a very efficient manner. Plumbing code related to mundane but critical tasks, like memory management socket lifecycle handling, is delegated to the core OpenResty codebase, letting us focus on the correctness of our business logic. This paradigm lets us ship changes and new features to our edge tier in a matter of minutes, not the weeks or months that writing native Nginx module code would require.

That said, OpenResty’s safety and stability does not mean that we can be blasé about the code we ship. Any new logic introduced at the edge of our infrastructure introduces complexity that needs to be validated. This validation extends beyond logic correctness – we also need to ensure that the code we ship doesn’t introduce unnecessary latency. Keeping latency at a minimum at our edge tier allows us to maintain a minimally-sized fleet of edge machines in each datacenter, reducing cost and operational complexity.

We kept these principles closely in mind while developing the logging code used by edge tier machines as part of our metrics pipeline. Our previous blog post describes the basic architecture we use in OpenResty to store and periodically ship metrics data. The implementation itself relied on code patterns that were designed to leverage OpenResty APIs that are known to be friendly to the JIT compiler (e.g., including the resty.core library). We made a deliberate effort to avoid relying on data structures that could only be iterated by Lua code that can not be JIT compiled, such as the infamous pairs built-in function and used FFI data structures to reduce memory footprint.

Following the first pass of development, we took the opportunity to profile our logging library in a lab environment to gauge the expected performance hit it would introduce in our environment. We first set up a lab environment with a single Nginx worker process serving a local file from disk, with no Lua phase handlers or custom business logic:


$ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

  4 threads and 20 connections

  Thread Stats   Avg Stdev     Max +/- Stdev

    Latency   218.65us 78.51us   7.68ms 96.75%

    Req/Sec    23.08k 1.43k   24.49k 91.18%

  468517 requests in 5.10s, 360.11MB read

Requests/sec:  91864.85

Transfer/sec:     70.61MB


The idea here was simply to get a rough baseline of where we’d expect to see from Nginx’s throughput and client latency. The raw numbers themselves here aren’t as important as the delta in values we’ll see in future tests. The change ratio gave us an idea of the overall impact of the running code, given the parameters of the requested benchmark.


Next, we setup the same benchmark, with the addition of an empty Lua log phase handler:


# nginx config


  -- nothing to see here!



$ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

  4 threads and 20 connections

  Thread Stats   Avg Stdev     Max +/- Stdev

    Latency   263.71us 286.15us  12.47ms 99.59%

    Req/Sec    19.73k 4.13k   76.84k 99.50%

  394735 requests in 5.10s, 303.40MB read

Requests/sec:  77402.86

Transfer/sec:     59.49MB


This gave us a rough idea of how much impact simply running the Lua VM would require. We saw about a 15% drop in max throughput in this test. Again, the idea here was not to perform a highly rigorous experiment, but rather to get a few simple benchmarks to make sure the inclusion of our logging library wouldn’t cause unexpected drops in performance.


We set up the benchmark for a third run, this time with our logging library included:

# nginx config

log_by_lua_block {

  local logger = require “edge.logger”




 $ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

  4 threads and 20 connections

  Thread Stats   Avg Stdev     Max +/- Stdev

    Latency   392.11us 219.26us  14.60ms 98.21%

    Req/Sec    13.03k 697.97    13.61k 95.10%

  264432 requests in 5.10s, 203.25MB read

Requests/sec:  51852.36

Transfer/sec:     39.85MB


Well, that was unexpected. We knew that we were going to see some drop in throughput, but running at little more than half the raw request rate of Nginx was indicative of a real bottleneck in the logging library. After a fruitless few hours of eyeballing the code looking for bad patterns, we turned to LuaJIT’s profiling tools. We ran the benchmark again with LuaJIT’s jit.dump module enabled, giving us a verbose data set detailing the JIT compiler’s behavior. JIT dumps are a useful tool, dumping out Lua bytecode, intermediate representation (IR), and generated machine code for all traces the JIT compiler generates. When the compiler cannot generate a trace, the profiler will spit out the failure reason at the point of trace abort. For example, when LuaJIT tried to compile a trace that relied on using pairs to iterate over a table, it generated the following output (most of it removed for brevity):


0000  . FUNCC               ; ipairs_aux

0066  JITERL  16 21     (logger.lua:497)

0067  GGET   13 15     ; “pairs”     (logger.lua:510)

0068  TGETV   14 5 9       (logger.lua:510)

0069  CALL   13 4 2       (logger.lua:510)

0000  . FUNCC               ; pairs

0070  ISNEXT  16 => 0093       (logger.lua:510)

—- TRACE 22 abort logger.lua:510 — NYI: bytecode 72


This particular trace wasn’t concerning, as it ran on a code path that executed infrequently on a background timer, so it wasn’t responsible for slowing down client requests. As we looked through the rest of the dump output, one repeated abort, in particular, caught our eye:

0009  . GGET     4 0 ; "ipairs"       (logger.lua:358)

0010  . MOV     5 2 (logger.lua:358)

0011  . CALL     4 4 2     (logger.lua:358)

0000  . . FUNCC               ; ipairs

0012  . JMP     7 => 0021     (logger.lua:358)

0021  . ITERC    7 3 3     (logger.lua:358)

0000  . . FUNCC               ; ipairs_aux

0022  . JITERL   7 21 (logger.lua:358)

—- TRACE 25 abort logger.lua:359 — inner loop in root trace


This trace was occurring in the hot path (code that was called as part of the logging phase on every request), so we suspected this was at least a substantial part of the slowdown. The cause of failure here was not an operator that couldn’t be compiled, but rather using the ipairs iterator inside of another trace. This was a bit confusing at first- ipairs itself can be compiled by the JITer, so it wasn’t immediately clear why this was failing. Taking a look at the chunk of code where this trace aborted:


350 local function incr_histogram(tbl, tbl_length, histogram, value)

351   if value == nil then

352     return

353   end

354   — increment sum which is at last index

355   tbl[tbl_length – 1] = tbl[tbl_length – 1] + value


357   for i, uplimit in ipairs(histogram) do

358     if value <= uplimit then

359       tbl[i – 1] = tbl[i – 1] + 1

360       return

361     end

362   end

363 end

In this case, a root JIT trace had been created at the start of the incr_histogram function. The for loop over the ipairs iterator on line 357 generated a new side trace, returning out of the builtin ipairs once it found the appropriate histogram bucket to increment. It seemed that return out of ipairs was causing the trace to abort. On a hunch, we replaced the call to ipairs with an integer-based loop:

diff --git a/files/lualib/edge/logger.lua b/files/lualib/edge/logger.lua

index 5860c5b..3b83106 100644

— a/files/lualib/edge/logger.lua

+++ b/files/lualib/edge/logger.lua

@@ -354,8 +354,8 @@ local function incr_histogram(tbl, tbl_length, histogram, value)

   — increment sum which is at last index

   tbl[tbl_length – 1] = tbl[tbl_length – 1] + value


-  for i, uplimit in ipairs(histogram) do

–    if value <= uplimit then

+  for i = 1, tbl_length – 1 do

+    if value <= histogram[i] then

       tbl[i – 1] = tbl[i – 1] + 1




This allows the LuaJIT engine to generate a side trace inside our loop (which, incidentally, may only execute over one iteration), and exit without being aborted by trying to return from within the builtin ipairs. Re-running the benchmark test with this change alone resulted in marked improvement:


 $ wrk -c 20 -d 5s -t4 http://localhost:8080/index.html

Running 5s test @ http://localhost:8080/index.html

  4 threads and 20 connections

  Thread Stats   Avg Stdev     Max +/- Stdev

    Latency   293.55us 94.27us   6.61ms 98.36%

    Req/Sec    17.19k 466.69    17.65k 86.27%

  348846 requests in 5.10s, 268.13MB read

Requests/sec:  68406.82

Transfer/sec:     52.58MB


That’s a substantial improvement both in max throughput and request latency seen by the benchmark client. We were able to corroborate this improvement in performance with an increase in the number of compiled traces reported by SystemTap and the OpenResty stapxx set of tools.

This change highlights the need to thoroughly study and understand the behavior of code. 

Our metrics pipeline is a crucial element of Kong Cloud. It gives us keen insight into the performance of the various Kong clusters we manage. Its position within our infrastructure requires that it be highly performant and stable. As we fleshed out the development of the OpenResty library that delivered metrics from our edge tier, we kept a close eye on performance and compiling behaviors, and were able to leverage LuaJIT’s excellent debug tooling to knockdown performance issues that weren’t caught during code review or integration testing, before we shipped them into production.