The above commands run the test file again and again until it fails. And at each run, it is recorded by rr (by the –rr switch in resty) and the LuaJIT dump (by the -j dump switch and the output is redirected to a file) logging is also turned on to help identify which JIT trace caused the failure. Other switches like -c , --http-conf and environment variables are needed to run the test itself.
Debug the error with rr replay
Once the above command fails, we will have a file generated by rr. It usually is saved at ~/.local/share/rr/. And to replay this error, just run:
It will run in gdb like interface:
And we can run the “continue” command in gdb to run to the error:
In addition to running the “continue” command, we can also run the “reverse-continue” command to play back forwards, which will stop at any breakpoint. In this case, after searching for the error message in the LuaJIT source, we set a breakpoint at lj_err_optype and will give us more information about the error details:
And at this point, we can get information by printing tname and o and o’s value in this lj_err_optype function.
This confirms the value under pointer o is a function value already. It is an error to perform arithmetic operations on a function value in Lua language.
The question now became where this value was overridden from a number value to a function value. This question could be answered in various ways. One approach is to set up a watch point in the gdb to the memory address that o is pointing to and do the reverse-continue.
In our case, there is a bug in the rr tool itself, the reverse-continue does not work with the watch point. The original rr paper did not support ARM. The porting to ARM64 was only done recently after the ARM64 LSE platform became available on the market, which rr depends on. We were able to use the tool anyway by running the “next” command again and again to check when the value is overridden. We set up several breakpoints to speed up this process:
lj_trace_exit, this function does the transformation from JITed code to interpreter in LuaJIT. It is an important point that marks whether the error happens in JITed code or in the interpreter.
TRACE_939, this the JITed code trace that contains the line init.lua:1441 in the error message. We can get this information from the JIT dump log file.
After setting up these breakpoints, we can run “continue” or “reverse-continue” in the gdb and examine the memory address that contains the wrong value regularly to search the code that writes the wrong value into that memory address.
Using this technique we found the location and it is in the following JITed code: