July 27, 2023

How to Fix a SIGILL Kong Crash on Arm64

Zhongwei Yao

Kong will crash on the ARM64 platform (the machine with Mac M1/M2 chips or any ARM64 platform). The error message shows the crash is triggered by the SIGILL signal, which means there is an illegal instruction in the Kong binary code. And it turns out to be caused by an error in the LuaJIT ARM64 JIT compiler. This post records how the error is found and fixed.

How to reproduce the error

To reproduce the error, I created an Apple M1 instance with MacOS ver 13.4 in AWS — because I don’t have an Apple ARM64 laptop at hand. And the Kong Enterprise version is required as this error only happens with Vitals enabled.

We also need to enable JIT in the kong/init.lua because in the current master version (3.4), the JIT is disabled to avoid this crash.

Since we know the error is caused by the JIT compiler (remember, disabling JIT solves the issue), it is very useful to make the reproduction easier, by ensuring the JIT compilation in LuaJIT happens more frequently. There is one switch that controls how LuaJIT detects hot traces: hotloop. Its default value is 56, which means if a loop (or call) runs more than 56 times, it will retrigger the JIT compilation in LuaJIT. So setting the value to 1 will trigger JIT compilation more frequently.

Here is the patch to enable JIT and tune the hotloop switch:

And after applying above change, build and run Kong with Vitals with the following command:

After running for several minutes, we will find there is error in the error.log:

The worker process 17436 is killed by Signal 4 (a.k.a. SIGILL).

How to debug the error

Since the work process is killed by SIGILL, we can use a debugger to help us get the context of this error. On MacOS, we use LLDB to attach the debugger to the worker process. Because it takes several minutes to crash, we can find the worker process PID in the error.log before it crashes and attach the LLDB to the PID by the following command and wait for it to crash.

After it crashes, we can get the crash context in LLDB:

As shown in the error.log, it crashes due to an illegal instruction and in the above case, the illegal instruction data is 0xfffbe79a.

And trying to get the backtrace does not help much to identify which code part causes the error because the backtrace only shows: the JITed code frame (frame #0: 0x0000000104623e5c) and the LuaJIT function (frame #1: lj_vm_resume, which is implemented in the interpreter vm_arm64.dasc in Assembly code) calls that JITed code.

Use the LuaJIT dump tool to help find the position error happens

When code is crashed in our case, we can use the LuaJIT dump tool to help identify where the error happens. Here is the patch to enable dumping in LuaJIT in Kong:

This change will dump the Bytecode, IR, and Machine code. And after applying the above change, rerun Kong and wait for it to crash and stop Kong. Stopping Kong will help reduce the file size of jit_dump.log. In my case, it is a 75 MB text log file. Then I search the jit_dump.log file to find a similar line like .long 0xfffcd399.

Here is what I find:

  • From the Section 3 (arm64 machine code section), there are 2 illegal instructions 0xfffbe379, 0xfffcdf78. It means this hot trace will run to crash by SIGILL signal.
  • From the Section 1 (Bytecode with source line section), there is a line comment "proxy_latency_max" (init.lua:757). This helps to identify the corresponding Lua code causing the error. By searching "proxy_latency_max", we can find the error comes from “vitals/init.lua”.
  • From the Section 2 (IR section, IR means Intermediate representation), we can guess the illegal instruction is probably caused by the ADD, XLOAD code generation part of LuaJIT arm64 compiler backend. Because we can see ucvtf d12, w25 instruction above the 0xfffbe379 ill instruction. By referring to the Arm64 instruction doc, ucvtf does the number conversion and it matches the line 0126 num CONV 0125 num.u32 in the IR section. This finding will help identify where to start debugging in LuaJIT.
---- TRACE 88 start init.lua:733
Section 1: Bytecode with source line section start. 
0032  SUBVN   14  13   0  ; 1       (init.lua:734)
0033  TGETV   14   9  14       (init.lua:734)
... 31 not related lines are omitted to keep the doc smaller.
0092  TGETS   19  14  12  ; "proxy_latency_max"       (init.lua:757)
0000  . . FUNCC               ; ffi.meta.__index
0093  ISF         15       (init.lua:760)
0094  JMP     20 => 0104
0104  TGETS   20  14  13  ; "ulat_min"       (init.lua:760)
0000  . . FUNCC               ; ffi.meta.__index
0105  ISF         15       (init.lua:761)
0192  FORL    10 => 0032       (init.lua:733)
Section 1: Bytecode with source line section end.
---- TRACE 88 IR
Section 2: IR start.
0001    int SLOAD  #13   RI
0002 >  int LE     0001  +2147483646
... 98 not related lines are omitted to keep the doc smaller.
0100    nil ASTORE 0080  nil 
0101    nil ASTORE 0082  nil 
0102  + int ADD    0003  +1  
0103 >  int LE     0102  0001
0104 ------ LOOP ------------
0105    i64 CONV   0102
... 200 not related lines are omitted to keep the doc smaller.
0126    num CONV   0125  num.u32 -- suspect start
0127    p64 ADD    0107  -36     
0128    int XLOAD  0127  
0129    p64 ADD    0107  -32 
0130    u32 XLOAD  0129  
0131    num CONV   0130  num.u32
0132    p64 ADD    0107  -28 
0133    int XLOAD  0132  
0134    p64 ADD    0107  -24 
0135    u32 XLOAD  0134  
0136    num CONV   0135  num.u32 -- suspect end
0137    p64 ADD    0107  -20 
... 46 not related lines are omitted to keep the doc smaller.
0183  + int ADD    0102  +1  
0184 >  int LE     0183  0001
0185    int PHI    0102  0183
Section 2: IR end.
---- TRACE 88 mcode 992
Section 3: arm64 machine code start.
100bc910c  sub   sp, sp, #144
100bc9110  str   x19, [sp, #144]
... 55 not related lines are omitted to keep the doc smaller.
100bc937c  cmp   w28, w19
100bc9380  bgt   0x00bc950c	->5
100bc9384  ldr   x30, 0x00acc400
... 32 not related lines are omitted to keep the doc smaller.
100bc9408  ldur  w25, [x27, #-44]
100bc940c  ucvtf d13, w25
100bc9410  ldur  w25, [x27, #-40]
100bc9414  ucvtf d12, w25
100bc9418  .long 0xfffbe379 -- error instruction 1
100bc941c  ucvtf d15, w24
100bc9420  .long 0xfffcdf78 -- error instruction 2
100bc9424  ucvtf d11, w23
100bc9428  ldur  w23, [x27, #-20]
100bc942c  ucvtf d10, w23
100bc9430  ldur  w23, [x27, #-16]
... 43 not related lines are omitted to keep the doc smaller.
100bc94e0  cmp   w28, w19
100bc94e4  ble   0x00bc9384	->LOOP
100bc94e8  b     0x00bc9524	->11
Section 3: arm64 machine code end.
---- TRACE 88 stop -> loop

Create minimal test case to reproduce the error

This step helps speed up the debugging process a lot. Because debugging with all the Kong code is slow (it takes several minutes to reach the crash point) and complex. I extracted the source in kong/vitals/init.lua and was able to create a minimal case that causes the error like below and save it to file test.lua:

Find the error in LuaJIT

With above minimal test case, we can easily run it to reproduce error:

And from the third finding in the previous section, we suspect there is some error in the ADD, XLOAD code generation part of LuaJIT ARM64 compiler backend. And we can set debug points on the compiler backend function asm_ir() and XLOAD asm, check whether the ill instruction is generated by this part of code.

And finally, it turns out the ill instruction is generated in this instruction fuse asm_fusexref() function, which is a part of LuaJIT compiler backend.

Instruction fuse is an optimization that combines multiple instructions into a single instruction process, which is the “instruction selection” part in standard compiler code generation. Depending on the underlying CPU architecture, the fuse optimization can generate more efficient instructions to run. For example, many CPUs have MLA instruction. It can do the multiplication and add in a single instruction. If the compiler finds there is a matched instruction sequence “multiple; add”, it will fuse them into one MLA instruction if all conditions meet. Without this optimization, two instructions will be generated instead of one.

Here is the dump of the minimal test case. The ADD and XLOAD are fused because, on ARM64 architecture, there is instruction that can do these two operations together (the LDR with offset instruction). But LuaJIT actually can do multiple fuse phases, in this case, there are two ADDs are fused with XLOADs and it generated two LDR instructions and the two LDR instructions can be further fused into one single LDP instruction, which is implemented at emit_lso().

---- TRACE 2 start fuse_test.lua:17
0022  SUBVN    9   8   0  ; 1       (fuse_test.lua:18)
0023  TGETV    9   4   9       (fuse_test.lua:18)
0000  . . FUNCC               ; ffi.meta.__index
0024  TGETS   10   9   9  ; "m1"       (fuse_test.lua:19)
0000  . . FUNCC               ; ffi.meta.__index
0025  TGETS   11   9  10  ; "m2"       (fuse_test.lua:20)
0000  . . FUNCC               ; ffi.meta.__index
0026  TNEW    12   3       (fuse_test.lua:22)
0027  TSETB   10  12   1       (fuse_test.lua:23)
0028  TSETB   11  12   2       (fuse_test.lua:24)
0029  TSETV   12   2   8       (fuse_test.lua:25)
0030  FORL     5 => 0022       (fuse_test.lua:17)
---- TRACE 2 IR
....              SNAP   #0   [ ---- ---- ]
0001 x28      int SLOAD  #7    I
... 32 not related lines are omitted to keep the doc smaller.
0033 ------------ LOOP ------------
0034 x27      i64 CONV   0031
0035          i64 BSHL   0034  +3  
0036 x27      p64 ADD    0035  0006
0037          p64 ADD    0036  -8  -- fuse with XLOAD
0038 x26      int XLOAD  0037  
0039          p64 ADD    0036  -4  -- fuse with XLOAD
0040 x27      int XLOAD  0039      
0041 x0    >  tab TNEW   #3    #0  
... 11 not related lines are omitted to keep the doc smaller.
0052       >  int LE     0051  +10 
0053 x28      int PHI    0031  0051
---- TRACE 2 mcode 396
... not related lines are omitted to keep the doc smaller.
104db7d0c  bgt   0x04db7db0	->2
104db7d10  ldr   x30, 0x04d68400
104db7d14  ldr   x1, 0x04d68408
104db7d18  cmp   x30, x1
104db7d1c  bls   0x04db7d34
104db7d20  mov   x1, #1
104db7d24  mov   x0, x22
104db7d28  bl    0x04b685b0	->lj_gc_step_jit
104db7d2c  orr   x30, x30, x30
104db7d30  cbnz  w0, 0x04db7db4	->3
104db7d34  mov   x1, #3
104db7d38  ldr   x0, 0x04d68560
104db7d3c  mov   x27, x28
104db7d40  add   x27, x25, x27, lsl #3
104db7d44  .long 0xffff6f7a      -- fuse error generates an ill instruction
104db7d48  bl    0x04b76e60	->lj_tab_new1
... not related lines are omitted to keep the doc smaller.
104db7d98  b     0x04db7dbc	->5
---- TRACE 2 stop -> loop

zsh: illegal hardware instruction  luajit -Ohotloop=1 -jdump=tbimsr fuse_test.lua

And there is an error in the emit_lso() when the offset is negative. In our case, the offset comes from the 0037 p64 ADD 0036 -8 instruction, which is -8. And a single line change will fix this issue:

Because in the LDP instruction, the offset field is 7 bit, so it is needed to be masked with 0x7f, otherwise, when the ofsm is negative (-8 = 0xfffffff8), the whole instruction will become like 0xffff6f7a with the existing implementation. It is an ill instruction. After applying the above fix in LuaJIT, our minimal case will run successfully with the following code generated:

102ecfd40  ldr   x0, [x22, #368]
102ecfd44  mov   x27, x28
102ecfd48  add   x27, x25, x27, lsl #3
102ecfd4c  ldp   w26, w27, [x27, #-8] -- fixed.
102ecfd50  bl    0x0082fce0	->lj_tab_new1
102ecfd54  add   x30, x24, w26, uxtw

I also have verified we can run Kong for more than 10 hours without running into the SIGILL crash again after applying the LuaJIT fix.

And by the way, this error impacts all ARM64 platforms, and it is OS independent. On an EC2 ARM64 Linux instance, we can also run into the same error like:

The fix has been created LuaJIT upstream