How to Fix a SIGILL Kong Crash on Arm64
Kong will crash on the ARM64 platform (the machine with Mac M1/M2 chips or any ARM64 platform). The error message shows the crash is triggered by the SIGILL signal, which means there is an illegal instruction in the Kong binary code. And it turns out to be caused by an error in the LuaJIT ARM64 JIT compiler. This post records how the error is found and fixed.
How to reproduce the error
To reproduce the error, I created an Apple M1 instance with MacOS ver 13.4 in AWS — because I don’t have an Apple ARM64 laptop at hand. And the Kong Enterprise version is required as this error only happens with Vitals enabled.
We also need to enable JIT in the kong/init.lua
because in the current master version (3.4), the JIT is disabled to avoid this crash.
Since we know the error is caused by the JIT compiler (remember, disabling JIT solves the issue), it is very useful to make the reproduction easier, by ensuring the JIT compilation in LuaJIT happens more frequently. There is one switch that controls how LuaJIT detects hot traces: hotloop
. Its default value is 56, which means if a loop (or call) runs more than 56 times, it will retrigger the JIT compilation in LuaJIT. So setting the value to 1 will trigger JIT compilation more frequently.
Here is the patch to enable JIT and tune the hotloop
switch:
And after applying above change, build and run Kong with Vitals with the following command:
After running for several minutes, we will find there is error in the error.log:
The worker process 17436 is killed by Signal 4 (a.k.a. SIGILL).
How to debug the error
Since the work process is killed by SIGILL, we can use a debugger to help us get the context of this error. On MacOS, we use LLDB to attach the debugger to the worker process. Because it takes several minutes to crash, we can find the worker process PID in the error.log before it crashes and attach the LLDB to the PID by the following command and wait for it to crash.
After it crashes, we can get the crash context in LLDB:
As shown in the error.log, it crashes due to an illegal instruction and in the above case, the illegal instruction data is 0xfffbe79a
.
And trying to get the backtrace does not help much to identify which code part causes the error because the backtrace only shows:
the JITed code frame (frame #0: 0x0000000104623e5c
) and
the LuaJIT function (frame #1: lj_vm_resume
, which is implemented in the interpreter vm_arm64.dasc
in Assembly code) calls that JITed code.
Use the LuaJIT dump tool to help find the position error happens
When code is crashed in our case, we can use the LuaJIT dump tool to help identify where the error happens. Here is the patch to enable dumping in LuaJIT in Kong:
This change will dump the Bytecode, IR, and Machine code. And after applying the above change, rerun Kong and wait for it to crash and stop Kong. Stopping Kong will help reduce the file size of jit_dump.log. In my case, it is a 75 MB text log file. Then I search the jit_dump.log file to find a similar line like .long 0xfffcd399.
Here is what I find:
- From the Section 3 (arm64 machine code section), there are 2 illegal instructions
0xfffbe379
,0xfffcdf78
. It means this hot trace will run to crash by SIGILL signal. - From the Section 1 (Bytecode with source line section), there is a line comment
"proxy_latency_max" (init.lua:757)
. This helps to identify the corresponding Lua code causing the error. By searching"proxy_latency_max"
, we can find the error comes from“vitals/init.lua”
. - From the Section 2 (IR section, IR means Intermediate representation), we can guess the illegal instruction is probably caused by the
ADD, XLOAD
code generation part of LuaJIT arm64 compiler backend. Because we can seeucvtf d12, w25
instruction above the0xfffbe379
ill instruction. By referring to the Arm64 instruction doc,ucvtf
does the number conversion and it matches theline 0126 num CONV 0125 num.u32
in the IR section. This finding will help identify where to start debugging in LuaJIT.
---- TRACE 88 start init.lua:733
Section 1: Bytecode with source line section start.
0032 SUBVN 14 13 0 ; 1 (init.lua:734)
0033 TGETV 14 9 14 (init.lua:734)
... 31 not related lines are omitted to keep the doc smaller.
0092 TGETS 19 14 12 ; "proxy_latency_max" (init.lua:757)
0000 . . FUNCC ; ffi.meta.__index
0093 ISF 15 (init.lua:760)
0094 JMP 20 => 0104
0104 TGETS 20 14 13 ; "ulat_min" (init.lua:760)
0000 . . FUNCC ; ffi.meta.__index
0105 ISF 15 (init.lua:761)
0192 FORL 10 => 0032 (init.lua:733)
Section 1: Bytecode with source line section end.
---- TRACE 88 IR
Section 2: IR start.
0001 int SLOAD #13 RI
0002 > int LE 0001 +2147483646
... 98 not related lines are omitted to keep the doc smaller.
0100 nil ASTORE 0080 nil
0101 nil ASTORE 0082 nil
0102 + int ADD 0003 +1
0103 > int LE 0102 0001
0104 ------ LOOP ------------
0105 i64 CONV 0102 i64.int
... 200 not related lines are omitted to keep the doc smaller.
0126 num CONV 0125 num.u32 -- suspect start
0127 p64 ADD 0107 -36
0128 int XLOAD 0127
0129 p64 ADD 0107 -32
0130 u32 XLOAD 0129
0131 num CONV 0130 num.u32
0132 p64 ADD 0107 -28
0133 int XLOAD 0132
0134 p64 ADD 0107 -24
0135 u32 XLOAD 0134
0136 num CONV 0135 num.u32 -- suspect end
0137 p64 ADD 0107 -20
... 46 not related lines are omitted to keep the doc smaller.
0183 + int ADD 0102 +1
0184 > int LE 0183 0001
0185 int PHI 0102 0183
Section 2: IR end.
---- TRACE 88 mcode 992
Section 3: arm64 machine code start.
100bc910c sub sp, sp, #144
100bc9110 str x19, [sp, #144]
... 55 not related lines are omitted to keep the doc smaller.
100bc937c cmp w28, w19
100bc9380 bgt 0x00bc950c ->5
->LOOP:
100bc9384 ldr x30, 0x00acc400
... 32 not related lines are omitted to keep the doc smaller.
100bc9408 ldur w25, [x27, #-44]
100bc940c ucvtf d13, w25
100bc9410 ldur w25, [x27, #-40]
100bc9414 ucvtf d12, w25
100bc9418 .long 0xfffbe379 -- error instruction 1
100bc941c ucvtf d15, w24
100bc9420 .long 0xfffcdf78 -- error instruction 2
100bc9424 ucvtf d11, w23
100bc9428 ldur w23, [x27, #-20]
100bc942c ucvtf d10, w23
100bc9430 ldur w23, [x27, #-16]
... 43 not related lines are omitted to keep the doc smaller.
100bc94e0 cmp w28, w19
100bc94e4 ble 0x00bc9384 ->LOOP
100bc94e8 b 0x00bc9524 ->11
Section 3: arm64 machine code end.
---- TRACE 88 stop -> loop
Create minimal test case to reproduce the error
This step helps speed up the debugging process a lot. Because debugging with all the Kong code is slow (it takes several minutes to reach the crash point) and complex. I extracted the source in kong/vitals/init.lua
and was able to create a minimal case that causes the error like below and save it to file test.lua:
Find the error in LuaJIT
With above minimal test case, we can easily run it to reproduce error:
And from the third finding in the previous section, we suspect there is some error in the ADD, XLOAD
code generation part of LuaJIT ARM64 compiler backend. And we can set debug points on the compiler backend function asm_ir() and XLOAD asm, check whether the ill instruction is generated by this part of code.
And finally, it turns out the ill instruction is generated in this instruction fuse asm_fusexref() function, which is a part of LuaJIT compiler backend.
Instruction fuse is an optimization that combines multiple instructions into a single instruction process, which is the “instruction selection” part in standard compiler code generation. Depending on the underlying CPU architecture, the fuse optimization can generate more efficient instructions to run. For example, many CPUs have MLA instruction. It can do the multiplication and add in a single instruction. If the compiler finds there is a matched instruction sequence “multiple; add”, it will fuse them into one MLA instruction if all conditions meet. Without this optimization, two instructions will be generated instead of one.
Here is the dump of the minimal test case. The ADD
and XLOAD
are fused because, on ARM64 architecture, there is instruction that can do these two operations together (the LDR with offset instruction). But LuaJIT actually can do multiple fuse phases, in this case, there are two ADDs are fused with XLOADs and it generated two LDR instructions and the two LDR instructions can be further fused into one single LDP instruction, which is implemented at emit_lso().
---- TRACE 2 start fuse_test.lua:17
0022 SUBVN 9 8 0 ; 1 (fuse_test.lua:18)
0023 TGETV 9 4 9 (fuse_test.lua:18)
0000 . . FUNCC ; ffi.meta.__index
0024 TGETS 10 9 9 ; "m1" (fuse_test.lua:19)
0000 . . FUNCC ; ffi.meta.__index
0025 TGETS 11 9 10 ; "m2" (fuse_test.lua:20)
0000 . . FUNCC ; ffi.meta.__index
0026 TNEW 12 3 (fuse_test.lua:22)
0027 TSETB 10 12 1 (fuse_test.lua:23)
0028 TSETB 11 12 2 (fuse_test.lua:24)
0029 TSETV 12 2 8 (fuse_test.lua:25)
0030 FORL 5 => 0022 (fuse_test.lua:17)
---- TRACE 2 IR
.... SNAP #0 [ ---- ---- ]
0001 x28 int SLOAD #7 I
... 32 not related lines are omitted to keep the doc smaller.
0033 ------------ LOOP ------------
0034 x27 i64 CONV 0031 i64.int
0035 i64 BSHL 0034 +3
0036 x27 p64 ADD 0035 0006
0037 p64 ADD 0036 -8 -- fuse with XLOAD
0038 x26 int XLOAD 0037
0039 p64 ADD 0036 -4 -- fuse with XLOAD
0040 x27 int XLOAD 0039
0041 x0 > tab TNEW #3 #0
... 11 not related lines are omitted to keep the doc smaller.
0052 > int LE 0051 +10
0053 x28 int PHI 0031 0051
---- TRACE 2 mcode 396
... not related lines are omitted to keep the doc smaller.
104db7d0c bgt 0x04db7db0 ->2
->LOOP:
104db7d10 ldr x30, 0x04d68400
104db7d14 ldr x1, 0x04d68408
104db7d18 cmp x30, x1
104db7d1c bls 0x04db7d34
104db7d20 mov x1, #1
104db7d24 mov x0, x22
104db7d28 bl 0x04b685b0 ->lj_gc_step_jit
104db7d2c orr x30, x30, x30
104db7d30 cbnz w0, 0x04db7db4 ->3
104db7d34 mov x1, #3
104db7d38 ldr x0, 0x04d68560
104db7d3c mov x27, x28
104db7d40 add x27, x25, x27, lsl #3
104db7d44 .long 0xffff6f7a -- fuse error generates an ill instruction
104db7d48 bl 0x04b76e60 ->lj_tab_new1
... not related lines are omitted to keep the doc smaller.
104db7d98 b 0x04db7dbc ->5
---- TRACE 2 stop -> loop
zsh: illegal hardware instruction luajit -Ohotloop=1 -jdump=tbimsr fuse_test.lua
And there is an error in the emit_lso() when the offset is negative. In our case, the offset comes from the 0037 p64 ADD 0036 -8
instruction, which is -8. And a single line change will fix this issue:
Because in the LDP instruction, the offset field is 7 bit, so it is needed to be masked with 0x7f, otherwise, when the ofsm
is negative (-8 = 0xfffffff8), the whole instruction will become like 0xffff6f7a
with the existing implementation. It is an ill instruction. After applying the above fix in LuaJIT, our minimal case will run successfully with the following code generated:
102ecfd40 ldr x0, [x22, #368]
102ecfd44 mov x27, x28
102ecfd48 add x27, x25, x27, lsl #3
102ecfd4c ldp w26, w27, [x27, #-8] -- fixed.
102ecfd50 bl 0x0082fce0 ->lj_tab_new1
102ecfd54 add x30, x24, w26, uxtw
I also have verified we can run Kong for more than 10 hours without running into the SIGILL crash again after applying the LuaJIT fix.
And by the way, this error impacts all ARM64 platforms, and it is OS independent. On an EC2 ARM64 Linux instance, we can also run into the same error like:
The fix has been created LuaJIT upstream https://github.com/LuaJIT/LuaJIT/pull/1028.