See what makes Kong the fastest, most-adopted API gateway
Check out the latest Kong feature releases and updates
Single platform for SaaS end-to-end connectivity
Enterprise service mesh based on Kuma and Envoy
Collaborative API design platform
How to Scale High-Performance APIs and Microservices
Call for speakers & sponsors, Kong API Summit 2023!
7 MIN READ
Kong Cloud has been using StatsD and Prometheus heavily in monitoring and metrics collecting. In this blog post we discuss the use case of StatsD and Prometheus on Kong Cloud, the performance problem we found, and the way we proposed to solve it.
StatsD is a metrics server that accepts events from UDP or TCP protocol and export them to various backends. A typical StatsD event looks like:
host.sfo1.request.count:123|c
Every StatsD event is a string in a format of <metricname>:<value>|<type>. The above example represents a metric called host.sfo1.request.count with the type of counter and the value of 123.
<metricname>:<value>|<type>
host.sfo1.request.count
counter
123
On Kong Cloud, we use the StatsD Prometheus exporter in our metrics pipeline to measure KPIs (Key Performance Indicator) of our service. The StatsD Prometheus exporter is a daemon that listens for StatsD events and exports them in Prometheus exposition formats. The exporter has a mapping config that maps the StatsD metric to a Prometheus metric.
In the example below, two StatsD events are translated according to the mapping config to the left.
On Kong Cloud, various StatsD events are generated for each request. When the client request rate climbed up to several thousand requests per second, we spotted a high CPU usage of the StatsD exporter that took one and a half cores on an AWS m4.large instance.
To take a closer look, we started doing some profiling upon the StatsD exporter and used the perf tool to sample stacks. This gave us a rough idea of functions taking up most of the CPU time. Then, we used the perf_data_converter tool to convert perf file perf.data to profile.proto and used pprof to analyze the results.
perf.data
profile.proto
pprof gave us the percentage of CPU time each functions took in descending order:
pprof
(pprof) top100 Showing nodes accounting for 311858250000, 83.29% of 374429750000 total Dropped 505 nodes (cum <= 1872148750) flat flat% sum% cum cum% 63493500000 16.96% 16.96% 63493500000 16.96% [[kernel.kallsyms]] 48018750000 12.82% 29.78% 48018750000 12.82% regexp.(*machine).onepass 31408750000 8.39% 38.17% 31408750000 8.39% regexp/syntax.(*Inst).MatchRunePos 17480000000 4.67% 42.84% 17480000000 4.67% runtime.duffcopy 17274250000 4.61% 47.45% 17274250000 4.61% regexp/syntax.EmptyOpContext 15302000000 4.09% 51.54% 15302000000 4.09% regexp.(*inputString).step 15244750000 4.07% 55.61% 15244750000 4.07% runtime.mapiternext 9875250000 2.64% 58.25% 9875250000 2.64% runtime.mallocgc 9310250000 2.49% 60.73% 9310250000 2.49% regexp.onePassNext 7658250000 2.05% 62.78% 7658250000 2.05% runtime.memmove 7192750000 1.92% 64.70% 7192750000 1.92% regexp/syntax.(*Inst).MatchRune 6075750000 1.62% 66.32% 6075750000 1.62% runtime.mapassign_faststr 5156750000 1.38% 67.70% 5156750000 1.38% sync.(*Mutex).Lock 4943500000 1.32% 69.02% 4943500000 1.32% github.com/prometheus/statsd_exporter/vendor/github.com/beorn7/perks/quantile.NewTargeted.func1 4722000000 1.26% 70.28% 4722000000 1.26% runtime.nextFreeFast 4608250000 1.23% 71.51% 4608250000 1.23% runtime.scanobject 4420750000 1.18% 72.69% 4420750000 1.18% sync.(*Mutex).Unlock 4281750000 1.14% 73.84% 4281750000 1.14% runtime.mapiterinit 4207000000 1.12% 74.96% 4207000000 1.12% runtime.heapBitsSetType 3840250000 1.03% 75.99% 3840250000 1.03% regexp.(*machine).tryBacktrack 3628250000 0.97% 76.96% 3628250000 0.97% [[vdso]] 3424000000 0.91% 77.87% 3424000000 0.91% runtime.memclrNoHeapPointers 2755500000 0.74% 78.61% 2755500000 0.74% runtime.heapBitsForObject 2732500000 0.73% 79.34% 2732500000 0.73% regexp.(*bitState).push 2695000000 0.72% 80.06% 2695000000 0.72% main.(*metricMapper).getMapping 2112000000 0.56% 80.62% 2112000000 0.56% regexp.(*Regexp).expand 2089250000 0.56% 81.18% 2089250000 0.56% github.com/prometheus/statsd_exporter/vendor/github.com/prometheus/common/model.hashAdd 2047000000 0.55% 81.72% 2047000000 0.55% runtime.growslice 2043500000 0.55% 82.27% 2043500000 0.55% runtime.greyobject 1939750000 0.52% 82.79% 1939750000 0.52% main.(*Exporter).Listen 1877000000 0.5% 83.29% 1877000000 0.5% runtime.makemap
The largest accumulation of events was system calls. This makes sense, as every UDP socket operation involves a system call. What caught our interest was that the total CPU time taken for the Go regular expression engine was around 37 percent of the CPU, which was twice the system calls.
We also tried rebuilding the StatsD exporter from the source using Go 1.10.3, which gave us this result:
(pprof) top 100 Showing nodes accounting for 75230000000, 82.30% of 91412500000 total Dropped 542 nodes (cum <= 457062500) flat flat% sum% cum cum% 17594750000 19.25% 19.25% 17594750000 19.25% [[kernel.kallsyms]] 16792500000 18.37% 37.62% 16792500000 18.37% regexp/syntax.writeRegexp 9554250000 10.45% 48.07% 9554250000 10.45% regexp.onePassCopy 6012750000 6.58% 54.65% 6012750000 6.58% regexp.cleanupOnePass 5165000000 5.65% 60.30% 5165000000 5.65% runtime.sigaltstack 4222250000 4.62% 64.92% 4222250000 4.62% regexp.(*Regexp).allMatches 2703500000 2.96% 67.87% 2703500000 2.96% regexp.compile 1195250000 1.31% 69.18% 1703500000 1.86% runtime.mallocgc 1159250000 1.27% 70.45% 1204500000 1.32% runtime.mapiternext 1112500000 1.22% 71.67% 1112500000 1.22% runtime.settls 1038500000 1.14% 72.80% 1038500000 1.14% sync.(*RWMutex).Unlock 912750000 1% 73.80% 912750000 1% runtime.mapassign_faststr 853000000 0.93% 74.73% 853000000 0.93% regexp.(*machine).step 840000000 0.92% 75.65% 869750000 0.95% sync.(*WaitGroup).Wait 814500000 0.89% 76.54% 1154750000 1.26% runtime.greyobject 780000000 0.85% 77.40% 780000000 0.85% runtime.largeAlloc 735000000 0.8% 78.20% 735000000 0.8% [statsd_exporter] 598750000 0.65% 78.86% 598750000 0.65% runtime.(*itabTableType).(runtime.add)-fm 588500000 0.64% 79.50% 593250000 0.65% runtime.heapBitsSetType 576750000 0.63% 80.13% 576750000 0.63% regexp.makeOnePass.func1 513250000 0.56% 80.69% 513250000 0.56% [[vdso]] 495250000 0.54% 81.23% 515000000 0.56% crypto/cipher.NewCTR 491000000 0.54% 81.77% 495500000 0.54% runtime.mapdelete_fast32 480750000 0.53% 82.30% 480750000 0.53% runtime.memmove
Go 1.10.3 definitely had a better performance in the framework itself and gave us 19 percent CPU time in system calls and 40 percent CPU time in the regular expression engine.
Since the largest portion of CPU was used in regular expression matching, we started to look into the source code of the StatsD exporter. The exporter uses regular expression to match rules and expand labels from matching groups. For example, the mapping config we see in the first section generates the following regular expressions:
All labels will be expanded using the regex.ExpandString function.
regex.ExpandString
After reviewing the rules we used on Kong Cloud, it turned out that we didn’t actually need the full power of regular expressions because:
.*.
host.*.status.\d+
host: "$1_$2"
Based on this observation, we started to refactor the StatsD exporter with a light-weight matcher with limited features that was just enough to suit our use cases. Then we implemented a simple matching type in addition to the glob and regex matching type using a finite state machine to mock the behaviour of regular expression matching.
simple
Every time the mapper rule is reloaded, the exporter will build a state machine following these steps:
mappings: - match: kong.*.*.*.request.count name: "kong_requests_proxy" labels: client: "$1" job: "kong_metrics" - match: kong.*.*.*.status.* name: "kong_status_code" labels: client: "$1" service: "$3" status_code: $4 job: "kong_metrics" - match: kong.*.*.*.kong_latency name: "kong_latency_proxy_request" labels: client: "$1" job: "kong_metrics" - match: kong.*.*.*.upstream_latency name: "kong_latency_upstream" labels: client: "$1" job: "kong_metrics" - match: kong.*.*.*.cache_datastore_hits_total name: "kong_cache_datastore_hits_total" labels: client: "$1" job: "kong_metrics" - match: kong.*.*.*.cache_datastore_misses_total name: "kong_cache_datastore_misses_total" labels: client: "$1" job: "kong_metrics" - match: kong.*.node.*.shdict.*.free_space name: "kong_shdict_free_space" labels: client: "$1" node: "$2" shdict: "$3" job: "kong_metrics" - match: kong.*.node.*.shdict.*.capacity name: "kong_shdict_capacity" labels: client: "$1" node: "$2" shdict: "$3" job: "kong_metrics"
— will build a state machine as follows:
client_$1
type labelFormatter struct { captureIdx int fmtString string }
map
We reran the perf and pprof profiling again with the workload with simple matching type enabled and received the following:
(pprof) top 100 Showing nodes accounting for 60068250000, 71.04% of 84558750000 total Dropped 480 nodes (cum <= 422793750) flat flat% sum% cum cum% 25372500000 30.01% 30.01% 25372500000 30.01% [[kernel.kallsyms]] 3718750000 4.40% 34.40% 3718750000 4.40% runtime.memmove 2691000000 3.18% 37.59% 4659000000 5.51% runtime.mallocgc 2037750000 2.41% 40.00% 2161500000 2.56% runtime.mapassign_faststr 2019000000 2.39% 42.38% 2166500000 2.56% runtime.heapBitsSetType 1964500000 2.32% 44.71% 2003500000 2.37% runtime.mapiternext 1556250000 1.84% 46.55% 1556250000 1.84% runtime.nextFreeFast (inline) 1369000000 1.62% 48.17% 1587750000 1.88% runtime.scanobject 1263000000 1.49% 49.66% 1263000000 1.49% github.com/beorn7/perks/quantile.NewTargeted.func1 1258750000 1.49% 51.15% 1848750000 2.19% regexp.(*machine).tryBacktrack 1255000000 1.48% 52.63% 1255000000 1.48% runtime.memclrNoHeapPointers 1217250000 1.44% 54.07% 1255250000 1.48% runtime.mapaccess2_faststr 1145000000 1.35% 55.43% 1145000000 1.35% [[vdso]] 1063500000 1.26% 56.68% 1063500000 1.26% runtime.aeshashbody 940500000 1.11% 57.80% 940500000 1.11% main.(*metricMapper).getMapping 870250000 1.03% 58.83% 870250000 1.03% regexp.(*inputString).step 802500000 0.95% 59.77% 802500000 0.95% regexp/syntax.(*Inst).MatchRunePos 796750000 0.94% 60.72% 815750000 0.96% runtime.mapaccess1_faststr 788250000 0.93% 61.65% 788250000 0.93% github.com/prometheus/common/model.hashAdd 784250000 0.93% 62.58% 784250000 0.93% main.(*Exporter).Listen 696750000 0.82% 63.40% 749750000 0.89% runtime.heapBitsForObject 695000000 0.82% 64.22% 695000000 0.82% strings.genSplit 675000000 0.8% 65.02% 675000000 0.8% syscall.Syscall6 615750000 0.73% 65.75% 615750000 0.73% regexp.(*machine).backtrack 560000000 0.66% 66.41% 603750000 0.71% github.com/prometheus/common/model.LabelsToSignature 521500000 0.62% 67.03% 521500000 0.62% runtime.unlock 510750000 0.6% 67.63% 510750000 0.6% runtime.lock 482250000 0.57% 68.20% 588250000 0.7% runtime.mapiterinit 468250000 0.55% 68.76% 468250000 0.55% github.com/beorn7/perks/quantile.(*stream).merge 454250000 0.54% 69.29% 454250000 0.54% unicode/utf8.ValidString 427500000 0.51% 69.80% 427500000 0.51% runtime.indexbytebody 374500000 0.44% 70.24% 428250000 0.51% runtime.growslice 337500000 0.4% 70.64% 590000000 0.7% regexp.(*bitState).push (inline) 335500000 0.4% 71.04% 445000000 0.53% runtime.greyobject
The CPU percentage in Syscall is larger the better under the same workload.
Now, we have less than five percent of CPU time spent in Go’s regular expression library. Those are the functions called in Prometheus client_golang library.
client_golang
We also ran a test which iterated the matching function only for 100,000 times, and we had 40 percent less time compared to glob matching type. If glob matching is not used as a fallback when simple matching can’t find a match and is completely removed, we had 60 percent time less spent in iteration.
There are a few caveats with simple matcher:
client.*.request.size.* client.*.*.size.*
The above will fail to match if the event is client.aaa.response.size.100 Correct rules are: client.*.response.status.* client.*.response.size.*
client.aaa.response.size.100
client.*.response.status.* client.*.response.size.*
After optimizing overhead introduced by the Go regex engine, most of the CPU time is now spent in system calls and Go runtime. On Kong Cloud, we use the StatsD Advanced Plugin to batch UDP packets for each request to significantly reduce the amount of system calls. This benefits both the sender(Kong instances) and the receiver(the StatsD exporter).
Kong Cloud delivers faster innovation through zero-touch, automatic updates and cloud-speed cadence of new product functionality. It offers the unique flexibility of running on any cloud option, including AWS, Azure and Google Cloud. Kong Cloud is currently in beta and will be available to the general public soon. To learn more about Kong Cloud, sign up for product updates at https://konghq.com/cloud/.
This work is included as a major enhancement in the Prometheus StatsD Exporter as of v0.8.0-rc1, which was released on October 10th, 2018. Check it out here!
Share Post