By on August 7, 2020

A Very Tiny Bug

Some weeks ago, a Kong user reported an error they had on ARM64 servers. When he confirmed that the same issue appeared on the latest version of Kong but not when trying an Intel server, I was “pinged.”

At that time, I was trying to isolate a bug that also affected ARM64 servers and not Intel ones. We already knew the problem was on the machine code emitting part of the Lua JIT compiler, but the exact case was hard to replicate. Fortunately, the bug seemed to appear on a single point in the code, and it wasn’t particularly performance-sensitive, so it was simple to disable JIT compilation at that point. In fact, the bug seemed to appear when some JIT strategies weren’t applicable and the VM was reverting to the interpreter. Forcing it to not even try to compile didn’t make anything measurably slower.

The steps provided by the user to reproduce this new bug involved running Kong in Kubernetes and then modifying the number of replicas to specific values while it was running. This is, of course, a supported scenario but far from my personal comfort zone.

Hoping to avoid a long exercise where 90% of the difficulty would be to replicate the user’s setup, I asked the user to try the same workaround. I certainly expected that it was just a weird manifestation of the same bug.

The user quickly answered that it didn’t make any difference, but he added two really good nuggets of information. The first was that he got a core dump reported in the logs:

That alone proved it was a different bug and certainly one that was much more straight-forward.

The second hint was that he identified the Lua code line executing at the core dump:

He was somewhat surprised to see that. Since the table.move() function was added to Lua on the 5.3 version, it wasn’t available on the 5.1 version, which is the one LuaJIT targets. In fact, it’s one of several functions and features from Lua versions 5.2 and 5.3 that have been picked to LuaJIT, since they don’t have any conflict with the previous version’s behavior. The specific table.move() function isn’t too popular because it’s easy to write the same functionality in plain Lua, and in the case of LuaJIT, with the same performance as the built-in. That was another thing our user also tried: replacing the function with “a simple loop assignment,” which also solved the issue.

===

With the exact point of failure identified, I tried to see what was actually happening. This line appears in the ring_address:dropIndices() function, which removes entries from the balancer “ring of addresses.” It has several codepaths to identify an efficient way to shrink a Lua “array,” and one of them uses the table.move() to copy a range of items to a new, empty table. This matches the original reported case, where a number of replicas is reduced from five to three. The exact numeric requirements wasn’t related to the actual bug — it’s just the quantities needed to actually use this specific line of code.

The bug itself was very easy to trigger — just about any case where the destination table is empty fails immediately on ARM64. This simple one-liner is enough to crash consistently:

Analyzing the resulting core dump, it was clear the crash happened in the interpreter implementation of the TSETR bytecode instruction, which is a “raw” variant of the “set table value” operation, equivalent to the rawset() built-in function.

But if there’s a problem with setting a table value, how does anything work?

Well, as it happens, the LuaJIT bytecode compiler never generates a TSETR instruction.

Huh?

A “normal” table set operation like t[k] = v uses the generic TSETV instruction, or TSETS if the key is a string constant like the common point.x = pos (which is syntax sugar for point['x'] = pos) or TSETB if the key is a small (byte-sized) integer.

Ah, but this was a raw table set!

Yes, but rawset(t, k, v) is a function call in bytecode. (If the JIT generates machine code, it does replace the call with some inline code, but then we’re no longer talking about bytecode instructions).

====>> — detour to inline built-in library functions

I mentioned above that it’s easy and efficient to implement table.move() in pure Lua code. In a way, that’s exactly what the LuaJIT implementation does. Instead of being implemented on C or assembly code, there’s a small piece of Lua code stored in the executable as a bunch of bytecode instructions. When the VM first loads some Lua source that calls this function, the resultant bytecode isn’t a call to an internal function. Instead, this piece of bytecode is inserted in place. Then, it can either be executed by the interpreter or compiled into machine code if it’s a hot path.

But if it’s just Lua code, why does it fail?

One of the weird things about LuaJIT is its convoluted compilation process, best explained by Peter Cawley in the Lua Workshop 2016 in San Francisco.

Lets look a the table.move() implementation:

Several strange things here:

  1. This is a C code file.

  2. The Lua code is within a C comment.

  3. The LJLIB_LUA() macro (defined here) expands to… nothing.

Effectively, for the C compiler, this is just white space. But during the build process, the src/host/genlibbc.lua script searches the code for the LJLIB_LUA() markers, loads the Lua code in the comment immediately after it and stores the bytecode as a C array of numbers in src/host/buildvm_lib.c

In short, it uses a Lua script to compile a C file that contains Lua code into a block of data to be injected into the LuaJIT VM, and from there, into Lua code that calls this function.

But before storing the bytecode, the fixup_dump() function sprinkles some extra magic over it:

  1. Replaces the CHECK_xx()function calls with ISNUM and ISTYPE bytecode instructions. There’s actually no CHECK_int() or CHECK_tab() function defined anywhere — they’re just placeholders for the bytecodes.

  2. Replaces the TGETV and TSETV instructions with TGETR and TSETR, respectively, if the targets have already been type-checked as tables.

  3. Replaces the ITERC generic iterator instruction with the specific table iterator ITERN when appropriate.

Aha! That’s where TSETR comes from!

These built-in bytecode chunks can’t have any dependencies, so instead of rawset(), they get TSETR. Incidentally, none of the other LJLIB_LUA() functions uses this instruction, so the only way to hit this bug is using table.move()

<<====

Back to the bug….

The LuaJIT interpreter is written in assembly language. That means there’s a different implementation for each processor architecture supported. It’s very complex to understand as a whole, but most bytecode instructions are very short, even if some similar instructions reuse code by a complex and very ad-hoc knot of jumps. Yes, the archetypal goto-heavy code that our ancestors left behind when “structured programming” became the next fad… alive and well in hyper-optimized assembly language.

For the ARM64 architecture, the code resides in the src/vm_arm64.dasc file. The TSETR instruction implementation starts at line 3165. The first bunch of code is mostly the same as the other TSETx instructions — loading some internal table structure fields into CPU registers and checking if the table can be modified at this moment without interfering with the GC incremental mark phase (that’s what the ->marked field and isblack() comments are about).

After those checks, if the key is an integer in the currently allocated “array part” of the table, it’s set directly via some simple pointer arithmetic (well, it doesn’t look that simple in assembly language, but it’s the kind of things modern CPUs love to do). If the key is integer but outside the array part, it calls the C function lj_tab_setinth() to put in the “hash part.”

Now, calling C from assembly is sometimes considered a “slow call” (by assembly language standards). Not because C is slow, but because the C function calling standards require some preparations to move parameters and results in and out. In LuaJIT code, some of that overhead is reduced with several tricks like:

  • Try to make it a “tail call,” a.k.a. a straight jump. When the “called” function returns, it will go straight to the previous caller. The job is reduced to adapting the previous call frame into the next one, which suggests the next point…

  • Standardize function signatures, or at least try to keep common parameters in the same positions. That allows the calling code to just leave some parameters where they’re already.

  • Put the most common but seldom-modified parameters (i.e., important pointers like L or G) into the first positions, which means they’re passed via CPU registers instead of the stack. Then try to keep these registers with the same meaning, reducing even more the need to move them between calls.

This means that the function preparation code is very specific and optimized differently for each case. Also, the last point (keeping the argument in the same register during actual use) is hard to maintain when the caller function code evolves.

In the TSETR instruction implementation, the call to lj_tab_setinth() is on a separate chunk of code called vmeta_tsetr. The signature of lj_tab_setinth() has the L pointer as the first parameter, represented by the register x0, or the macro CARG1, but this register wasn’t set. Reading the main TSETR code, we see the CARG1 register was used for the pointer to the array part of the table. Since we were moving into a new, empty table, it was a NULL pointer…uh oh.

Just copying the L pointer (which is always kept around in register x23 and the typed macro L conveniently expands to that) to the CARG1 (x0) register fixes the issue. The PR was posted, and it was quickly accepted, with the comment that this same issue appears in ARM32 and PowerPC architectures.

This was actually a very shallow bug, where a given codepath failed consistently. The only reason it wasn’t detected before is that the affected function isn’t popular and the few existing tests seem to move entries within an already populated table, completely missing the case where it calls lj_tab_setinth().

The good news is that with the growing popularity of non-Intel architectures (and ARM64 in particular), these kind of bugs are being shaken out. Personally, I’m happy that these other architectures are in general more regular and have plenty of registers, making for more readable assembly language code.