Troubleshooting an Intermittent Failure in CI Tests on ARM64

October 4, 2023

6 min read

Zhongwei Yao

The Kong Gateway CI was failing intermittently (about once every 100 runs) on the ARM64 platform with a strange error: “attempt to perform arithmetic on local 'i' (a function value)”. The variable i in the context is an integer but at runtime, it was sometimes a function value.

This is caused by an error in the LuaJIT ARM64 JIT compiler. We’ve investigated and found the issue and the fix is merged in the LuaJIT upstream. This document describes how we fixed the error.

Narrow down the scope: Is the error in Lua code or elsewhere?

The error when encountered was on line 3 in Kong's db init.lua code file.

Looking at the above code, the value i should be an integer, but the error says i is a function value. This error was only ever on ARM64 runners used in Kong CI. The first troubleshooting step was to check whether this error was caused by Lua code or other parts of Gateway — like the Nginx, LuaJIT, etc. This can be determined by disabling the JIT compilation for the function that contains the above code with: jit.off(new). After disabling JIT for the above function, the error was no longer reproducible. That means the error was caused by the underlying LuaJIT machine code generation.

Get deterministic results from random failure

One hurdle to debugging the failure was the random nature of this error. It was seen in about one out of every 100 CI runs. A minimal reproducible test case was needed to help speed up fixing the issue. The tool we used is called reverse debugger: Mozilla rr. If we use the analogy of a core dump being akin to taking a picture of the program when it fails, the record file generated by rr is like taking a video of the program when it fails. Like a real video file, you can play it back and forth, with the record file from rr, you can run the failure program back and forth and always get the same result. This helps tremendously when the failure happens randomly.The CI runs our unit tests with the busted framework and is invoked normally like busted spec/02-integration/03-db/01-db_spec.lua

The above commands run the test file again and again until it fails. And at each run, it is recorded by rr (by the –rr switch in resty) and the LuaJIT dump (by the -j dump switch and the output is redirected to a file) logging is also turned on to help identify which JIT trace caused the failure. Other switches like -c , --http-conf and environment variables are needed to run the test itself.

Debug the error with rr replay

Once the above command fails, we will have a file generated by rr. It usually is saved at ~/.local/share/rr/. And to replay this error, just run:

It will run in gdb like interface:

And we can run the “continue” command in gdb to run to the error:

In addition to running the “continue” command, we can also run the “reverse-continue” command to play back forwards, which will stop at any breakpoint. In this case, after searching for the error message in the LuaJIT source, we set a breakpoint at lj_err_optype and will give us more information about the error details:

And at this point, we can get information by printing tname and o and o’s value in this lj_err_optype function.

This confirms the value under pointer o is a function value already. It is an error to perform arithmetic operations on a function value in Lua language.

The question now became where this value was overridden from a number value to a function value. This question could be answered in various ways. One approach is to set up a watch point in the gdb to the memory address that o is pointing to and do the reverse-continue.

In our case, there is a bug in the rr tool itself, the reverse-continue does not work with the watch point. The original rr paper did not support ARM. The porting to ARM64 was only done recently after the ARM64 LSE platform became available on the market, which rr depends on. We were able to use the tool anyway by running the “next” command again and again to check when the value is overridden. We set up several breakpoints to speed up this process:

lj_trace_exit, this function does the transformation from JITed code to interpreter in LuaJIT. It is an important point that marks whether the error happens in JITed code or in the interpreter.
TRACE_939, this the JITed code trace that contains the line init.lua:1441 in the error message. We can get this information from the JIT dump log file.

After setting up these breakpoints, we can run “continue” or “reverse-continue” in the gdb and examine the memory address that contains the wrong value regularly to search the code that writes the wrong value into that memory address.

Using this technique we found the location and it is in the following JITed code:

The stp x30, x0, [x19, #488] is the line of ARM64 assembly code that corrupts the memory. To fix this error, we need to deep dive into the LuaJIT JIT compiler and see what happens during the code generation.

Fix the error in the code generation

After searching the LuaJIT source code, there is only one place that can generate the “stp” instruction, the emit_lso function.

We can set a breakpoint on this function and check how the “stp” instruction is generated.

Basically, this function is doing two things:

Generate regular load/store instruction for the load/store Intermediate Representation in the LuaJIT compiler.
Try to merge consecutive load/store instructions when some conditions meet, like if the 2 stores have the same index register and the offset is adjacent, which is also called peephole optimization in compiler optimization.

In our failure case, this function has merged the following 2 store instructions:

This is not correct!

Because the offset 488 and -16 are not adjacent. The next store position of 488 is:

The code treats real offset -16 the same as 496, which is wrong. The following explains why this happens.

And according to ARM64 instruction definition, the stur instruction is encoded like:

The offset (like -16) is stored in field imm9, which is short for immediate integer with 9 bits long. Imm9 can be either signed or unsigned depending how you interpret it. And 496 in the imm9 unsigned integer format is 1 1111 000. -16 in the imm9 signed integer format is also 1 1111 000. They are the same! This leads to passing the check at this line in above emit_lso function:

This causes the generation of the wrong instruction: stp x30, x0, [x19, #488]. To fix this error, we can add additional checks for the offset range before merging as we did in the submitted PR to LuaJIT upstream.

In retrospect, peephole optimization errors should always be triggered if there are such pairs of instructions. So why does it happen intermittently in our CI, which always has the same input in every run?

The reason is Lua language itself has non-deterministic behaviors like the order of table iteration is undefined. And LuaJIT itself also has non-deterministic behaviors in the implementation, like the hot loop count, side trace penalty counter, etc. All of them will cause different JIT compilation behaviors even with the same input, especially if the test is large with thousands of JIT traces (in the above case, there are about 1500 JIT traces). These non-deterministic behaviors lead to intermittent failure.

Topics:API Gateway

API Testing

API Development