Kong will crash on the ARM64 platform (the machine with Mac M1/M2 chips or any ARM64 platform). The error message shows the crash is triggered by the SIGILL signal, which means there is an illegal instruction in the Kong binary code. And it turns out to be caused by an error in the LuaJIT ARM64 JIT compiler. This post records how the error is found and fixed.
How to reproduce the error
To reproduce the error, I created an Apple M1 instance with MacOS ver 13.4 in AWS — because I don’t have an Apple ARM64 laptop at hand. And the Kong Enterprise version is required as this error only happens with Vitals enabled.
We also need to enable JIT in the kong/init.lua because in the current master version (3.4), the JIT is disabled to avoid this crash.
Since we know the error is caused by the JIT compiler (remember, disabling JIT solves the issue), it is very useful to make the reproduction easier, by ensuring the JIT compilation in LuaJIT happens more frequently. There is one switch that controls how LuaJIT detects hot traces: hotloop. Its default value is 56, which means if a loop (or call) runs more than 56 times, it will retrigger the JIT compilation in LuaJIT. So setting the value to 1 will trigger JIT compilation more frequently.
Here is the patch to enable JIT and tune the hotloop switch:
diff --git a/kong/init.lua b/kong/init.luaindex fd200e134..0ecc68047 100644--- a/kong/init.lua+++ b/kong/init.lua
@@ -45,9+45,17 @@ pcall(require,"luarocks.loader")-- Silicon-based machines are used mostly in development and local-- testing / playground mode.--local M1 = jit and jit.os =="OSX"and jit.arch =="arm64"if M1 then- jit.off()-- jit is enabled by default after removing this line+ jit.opt.start("hotloop=1")end
And after applying above change, build and run Kong with Vitals with the following command:
$ exportKONG_VITALS=on
$ make dev #in the kong-ee project root dirtory$ . bazel-bin/build/kong-dev-venv.sh
(kong-dev) $ kong
After running for several minutes, we will find there is error in the error.log:
2023/07/04 18:45:30 [notice]17217#0: signal 20 (SIGCHLD) received from 174362023/07/04 18:45:30 [alert]17217#0: worker process 17436 exited on signal 4
The worker process 17436 is killed by Signal 4 (a.k.a. SIGILL).
How to debug the error
Since the work process is killed by SIGILL, we can use a debugger to help us get the context of this error. On MacOS, we use LLDB to attach the debugger to the worker process. Because it takes several minutes to crash, we can find the worker process PID in the error.log before it crashes and attach the LLDB to the PID by the following command and wait for it to crash.
$lldb -p ${WORKER_PID}
After it crashes, we can get the crash context in LLDB:
As shown in the error.log, it crashes due to an illegal instruction and in the above case, the illegal instruction data is 0xfffbe79a.
And trying to get the backtrace does not help much to identify which code part causes the error because the backtrace only shows:
the JITed code frame (frame #0: 0x0000000104623e5c) and
the LuaJIT function (frame #1: lj_vm_resume, which is implemented in the interpreter vm_arm64.dasc in Assembly code) calls that JITed code.
Use the LuaJIT dump tool to help find the position error happens
When code is crashed in our case, we can use the LuaJIT dump tool to help identify where the error happens. Here is the patch to enable dumping in LuaJIT in Kong:
diff --git a/kong/init.lua b/kong/init.luaindex 0ecc68047..b565153ce 100644--- a/kong/init.lua+++ b/kong/init.lua
@@ -54,8+54,11 @@ if M1 then jit.opt.start("hotloop=1")+local dump = require "jit.dump"+ dump.on("+bimT","/Users/ec2-user/projects/kong-ee/luajit_logs/jit_dump.log")
This change will dump the Bytecode, IR, and Machine code. And after applying the above change, rerun Kong and wait for it to crash and stop Kong. Stopping Kong will help reduce the file size of jit_dump.log. In my case, it is a 75 MB text log file. Then I search the jit_dump.log file to find a similar line like .long 0xfffcd399.
Here is what I find:
From the Section 3 (arm64 machine code section), there are 2 illegal instructions 0xfffbe379, 0xfffcdf78. It means this hot trace will run to crash by SIGILL signal.
From the Section 1 (Bytecode with source line section), there is a line comment "proxy_latency_max" (init.lua:757). This helps to identify the corresponding Lua code causing the error. By searching "proxy_latency_max", we can find the error comes from “vitals/init.lua”.
---- TRACE 88 start init.lua:733
Section 1: Bytecode with source line section start.
0032 SUBVN 14 13 0 ; 1 (init.lua:734)
0033 TGETV 14 9 14 (init.lua:734)
... 31 not related lines are omitted to keep the doc smaller.
0092 TGETS 19 14 12 ; "proxy_latency_max" (init.lua:757)
0000 . . FUNCC ; ffi.meta.__index
0093 ISF 15 (init.lua:760)
0094 JMP 20 => 0104
0104 TGETS 20 14 13 ; "ulat_min" (init.lua:760)
0000 . . FUNCC ; ffi.meta.__index
0105 ISF 15 (init.lua:761)
0192 FORL 10 => 0032 (init.lua:733)
Section 1: Bytecode with source line section end.
---- TRACE 88 IR
Section 2: IR start.
0001 int SLOAD #13 RI
0002 > int LE 0001 +2147483646
... 98 not related lines are omitted to keep the doc smaller.
0100 nil ASTORE 0080 nil
0101 nil ASTORE 0082 nil
0102 + int ADD 0003 +1
0103 > int LE 0102 0001
0104 ------ LOOP ------------
0105 i64 CONV 0102 i64.int
... 200 not related lines are omitted to keep the doc smaller.
0126 num CONV 0125 num.u32 -- suspect start
0127 p64 ADD 0107 -36
0128 int XLOAD 0127
0129 p64 ADD 0107 -32
0130 u32 XLOAD 0129
0131 num CONV 0130 num.u32
0132 p64 ADD 0107 -28
0133 int XLOAD 0132
0134 p64 ADD 0107 -24
0135 u32 XLOAD 0134
0136 num CONV 0135 num.u32 -- suspect end
0137 p64 ADD 0107 -20
... 46 not related lines are omitted to keep the doc smaller.
0183 + int ADD 0102 +1
0184 > int LE 0183 0001
0185 int PHI 0102 0183
Section 2: IR end.
---- TRACE 88 mcode 992
Section 3: arm64 machine code start.
100bc910c sub sp, sp, #144
100bc9110 str x19, [sp, #144]
... 55 not related lines are omitted to keep the doc smaller.
100bc937c cmp w28, w19
100bc9380 bgt 0x00bc950c ->5
->LOOP:
100bc9384 ldr x30, 0x00acc400
... 32 not related lines are omitted to keep the doc smaller.
100bc9408 ldur w25, [x27, #-44]
100bc940c ucvtf d13, w25
100bc9410 ldur w25, [x27, #-40]
100bc9414 ucvtf d12, w25
100bc9418 .long 0xfffbe379 -- error instruction 1
100bc941c ucvtf d15, w24
100bc9420 .long 0xfffcdf78 -- error instruction 2
100bc9424 ucvtf d11, w23
100bc9428 ldur w23, [x27, #-20]
100bc942c ucvtf d10, w23
100bc9430 ldur w23, [x27, #-16]
... 43 not related lines are omitted to keep the doc smaller.
100bc94e0 cmp w28, w19
100bc94e4 ble 0x00bc9384 ->LOOP
100bc94e8 b 0x00bc9524 ->11
Section 3: arm64 machine code end.
---- TRACE 88 stop -> loop
Create minimal test case to reproduce the error
This step helps speed up the debugging process a lot. Because debugging with all the Kong code is slow (it takes several minutes to reach the crash point) and complex. I extracted the source in kong/vitals/init.lua and was able to create a minimal case that causes the error like below and save it to file test.lua:
local ffi = require "ffi"ffi.cdef[[
typedef struct vitals_metrics_s {
int32_t m1;
int32_t m2;
} data;
]]local const_data_ptr = ffi.typeof("const data*")local out_data ={}local metrics = ffi.new("data[10]")local data = ffi.cast(const_data_ptr, metrics)for i =1,10dolocal c = data[i -1]local m1 = c.m1
local m2 = c.m2
out_data[i]={ m1, m2,}endprint(out_data)
Find the error in LuaJIT
With above minimal test case, we can easily run it to reproduce error:
$luajit -Ohotloop=1 test.lua
kill by SIGILL
$lldb luajit -- -Ohotloop=1 test.lua #debug it.
Instruction fuse is an optimization that combines multiple instructions into a single instruction process, which is the “instruction selection” part in standard compiler code generation. Depending on the underlying CPU architecture, the fuse optimization can generate more efficient instructions to run. For example, many CPUs have MLA instruction. It can do the multiplication and add in a single instruction. If the compiler finds there is a matched instruction sequence “multiple; add”, it will fuse them into one MLA instruction if all conditions meet. Without this optimization, two instructions will be generated instead of one.
I also have verified we can run Kong for more than 10 hours without running into the SIGILL crash again after applying the LuaJIT fix.
And by the way, this error impacts all ARM64 platforms, and it is OS independent. On an EC2 ARM64 Linux instance, we can also run into the same error like:
# luajit without the fix.(kong-dev)[ec2-user@ip-172-31-17-124 kong-ee]$ luajit -Ohotloop=1 test.lua
Illegal instruction (core dumped)(kong-dev)[ec2-user@ip-172-31-17-124 kong-ee]$ uname -a
Linux ip-172-31-17-124.eu-west-2.compute.internal 6.1.34-58.102.amzn2023.aarch64 #1 SMP Tue Jun 27 21:37:45 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
The fix has been created LuaJIT upstream https://github.com/LuaJIT/LuaJIT/pull/1028.
In the Kubernetes world, the Ingress API has been the longstanding staple for getting access to your Services from outside your cluster network. Ingress has served us well over the years and can be found present in several dozen different implementa
This post is part of our Kong Champions series, where real Kong users walk you through technical challenges, use cases, and new technology they're using in their day-to-day. Sign up here to become a Kong Champion. As a Kong user, I've had the oppo
Few things are more frustrating than encountering a product with either no documentation or worse: documentation that leads you astray. When it comes to developing APIs, schemas typically define how requests and responses are formatted and guide how
The release of Kuma 2.3 brings experimental support for GAMMA (Gateway API for Mesh Management and Administration) resources. Kuma has long supported Gateway API with the built-in gateway for ingress traffic but with GAMMA support, users can specify
Creating API design guidelines is a common practice for many enterprises. The goal? Ensuring that all teams involved in API development will adhere to them. However, this goal is often not achieved, as developers may not take the time to read, study
This tutorial shows you how easy it is to build a custom Lua plugin for Kong Gateway. My Kong Lua plugin example will automatically add a custom header to any response sent out, indicating the current plugin version. Kong API Gateway is built on O
Traditional APIs are, in a word, predictable. You know what you're getting: Compute costs that don't surprise you Traffic patterns that behave themselves Clean, well-defined request and response cycles AI APIs, especially anything that runs on LLMs