Blog
  • AI Gateway
  • AI Security
  • AIOps
  • API Security
  • API Gateway
|
    • API Management
    • API Development
    • API Design
    • Automation
    • Service Mesh
    • Insomnia
    • View All Blogs
  1. Home
  2. Blog
  3. Engineering
  4. Troubleshooting an Intermittent Failure in CI Tests on ARM64
Engineering
October 4, 2023
6 min read

Troubleshooting an Intermittent Failure in CI Tests on ARM64

Zhongwei Yao

The Kong Gateway CI was failing intermittently (about once every 100 runs) on the ARM64 platform with a strange error: “attempt to perform arithmetic on local 'i' (a function value)”. The variable i in the context is an integer but at runtime, it was sometimes a function value.

This is caused by an error in the LuaJIT ARM64 JIT compiler. We’ve investigated and found the issue and the fix is merged in the LuaJIT upstream. This document describes how we fixed the error.

Narrow down the scope: Is the error in Lua code or elsewhere?

The error when encountered was on line 3 in Kong's db init.lua code file.

Looking at the above code, the value i should be an integer, but the error says i is a function value. This error was only ever on ARM64 runners used in Kong CI. The first troubleshooting step was to check whether this error was caused by Lua code or other parts of Gateway — like the Nginx, LuaJIT, etc. This can be determined by disabling the JIT compilation for the function that contains the above code with: jit.off(new). After disabling JIT for the above function, the error was no longer reproducible. That means the error was caused by the underlying LuaJIT machine code generation.

Get deterministic results from random failure

One hurdle to debugging the failure was the random nature of this error. It was seen in about one out of every 100 CI runs. A minimal reproducible test case was needed to help speed up fixing the issue. The tool we used is called reverse debugger: Mozilla rr. If we use the analogy of a core dump being akin to taking a picture of the program when it fails, the record file generated by rr is like taking a video of the program when it fails. Like a real video file, you can play it back and forth, with the record file from rr, you can run the failure program back and forth and always get the same result. This helps tremendously when the failure happens randomly.The CI runs our unit tests with the busted framework and is invoked normally like busted spec/02-integration/03-db/01-db_spec.lua

The above commands run the test file again and again until it fails. And at each run, it is recorded by rr (by the –rr switch in resty) and the LuaJIT dump (by the -j dump switch and the output is redirected to a file) logging is also turned on to help identify which JIT trace caused the failure. Other switches like -c , --http-conf and environment variables are needed to run the test itself.

Debug the error with rr replay

Once the above command fails, we will have a file generated by rr. It usually is saved at ~/.local/share/rr/. And to replay this error, just run:

It will run in gdb like interface:

And we can run the “continue” command in gdb to run to the error:

In addition to running the “continue” command, we can also run the “reverse-continue” command to play back forwards, which will stop at any breakpoint. In this case, after searching for the error message in the LuaJIT source, we set a breakpoint at lj_err_optype and will give us more information about the error details:

And at this point, we can get information by printing tname and o and o’s value in this lj_err_optype function.

This confirms the value under pointer o is a function value already. It is an error to perform arithmetic operations on a function value in Lua language. 

The question now became where this value was overridden from a number value to a function value. This question could be answered in various ways. One approach is to set up a watch point in the gdb to the memory address that o is pointing to and do the reverse-continue. 

In our case, there is a bug in the rr tool itself, the reverse-continue does not work with the watch point. The original rr paper did not support ARM. The porting to ARM64 was only done recently after the ARM64 LSE platform became available on the market, which rr depends on. We were able to use the tool anyway by running the “next” command again and again to check when the value is overridden. We set up several breakpoints to speed up this process: 

  1. lj_trace_exit, this function does the transformation from JITed code to interpreter in LuaJIT. It is an important point that marks whether the error happens in JITed code or in the interpreter.
  2. TRACE_939, this the JITed code trace that contains the line init.lua:1441 in the error message. We can get this information from the JIT dump log file.

After setting up these breakpoints, we can run “continue” or “reverse-continue” in the gdb and examine the memory address that contains the wrong value regularly to search the code that writes the wrong value into that memory address.

Using this technique we found the location and it is in the following JITed code:

The stp x30, x0, [x19, #488] is the line of ARM64 assembly code that corrupts the memory. To fix this error, we need to deep dive into the LuaJIT JIT compiler and see what happens during the code generation.

Fix the error in the code generation

After searching the LuaJIT source code, there is only one place that can generate the “stp” instruction, the emit_lso function.

We can set a breakpoint on this function and check how the “stp” instruction is generated.

Basically, this function is doing two things:

  1. Generate regular load/store instruction for the load/store Intermediate Representation in the LuaJIT compiler.
  2. Try to merge consecutive load/store instructions when some conditions meet, like if the 2 stores have the same index register and the offset is adjacent, which is also called peephole optimization in compiler optimization.

In our failure case, this function has merged the following 2 store instructions:

This is not correct! 

Because the offset 488 and -16 are not adjacent. The next store position of 488 is:


The code treats real offset -16 the same as 496, which is wrong. The following explains why this happens.

And according to ARM64 instruction definition, the stur instruction is encoded like:

The offset (like -16) is stored in field imm9, which is short for immediate integer with 9 bits long. Imm9 can be either signed or unsigned depending how you interpret it. And 496 in the imm9 unsigned integer format is 1 1111 000. -16 in the imm9 signed integer format is also 1 1111 000. They are the same! This leads to passing the check at this line in above emit_lso function:

This causes the generation of the wrong instruction: stp x30, x0, [x19, #488]. To fix this error, we can add additional checks for the offset range before merging as we did in the submitted PR to LuaJIT upstream.

In retrospect, peephole optimization errors should always be triggered if there are such pairs of instructions. So why does it happen intermittently in our CI, which always has the same input in every run? 

The reason is Lua language itself has non-deterministic behaviors like the order of table iteration is undefined. And LuaJIT itself also has non-deterministic behaviors in the implementation, like the hot loop count, side trace penalty counter, etc. All of them will cause different JIT compilation behaviors even with the same input, especially if the test is large with thousands of JIT traces (in the above case, there are about 1500 JIT traces). These non-deterministic behaviors lead to intermittent failure.

API GatewayAPI TestingAPI Development

More on this topic

Videos

How to Migrate Collections Into Insomnia

Videos

API Testing with Insomnia’s Collection Runner

See Kong in action

Accelerate deployments, reduce vulnerabilities, and gain real-time visibility. 

Get a Demo
Topics
API GatewayAPI TestingAPI Development
Share on Social
Zhongwei Yao

Recommended posts

Guide to API Testing: Understanding the Basics

Kong Logo
EngineeringSeptember 1, 2025

Key Takeaways API testing is crucial for ensuring the reliability, security, and performance of modern applications. Different types of testing, such as functional, security, performance, and integration testing, should be employed to cover all aspe

Adam Bauman

6 Reasons Why Kong Insomnia Is Developers' Preferred API Client

Kong Logo
EngineeringAugust 8, 2025

So, what exactly is Kong Insomnia? Kong Insomnia is your all-in-one platform for designing, testing, debugging, and shipping APIs at speed. Built for developers who need power without bloat, Insomnia helps you move fast whether you’re working solo,

Juhi Singh

Using Kong Gateway to Adapt SOAP Services to the JSON World

Kong Logo
EngineeringSeptember 6, 2023

While JSON-based APIs are ubiquitous in the API-centric world of today, many industries adapted internet-based protocols for automated information exchange way before REST and JSON became popular. One attempt to establish a standardized protocol sui

Hans Hübner

Migration Options for IBM Cloud API Gateway Customers

Kong Logo
EngineeringDecember 14, 2022

IBM recently announced the deprecation of its Cloud API Gateway, a service used to create and manage APIs by placing a gateway in front of existing IBM Cloud endpoints. With this move, IBM Cloud Functions and IBM Cloud Foundry are no longer able to

Syed Mahmood

Optimize Your API Gateway with Chaos Engineering

Kong Logo
EngineeringAugust 10, 2022

As engineers and architects, we automatically build resilience into platforms as far as possible. But what about the unknown failures? What about the unknown behavior of your platform? The philosopher, Socrates, once said "You don’t know what you do

Andrew Kew

New Storage Engine for Kong Hybrid and DB-less Deployments

Kong Logo
EngineeringMarch 9, 2022

We understand that our customers need to deploy Kong in a variety of environments and with different deployment mode needs. That is why two years ago, in Kong 1.1, we introduced DB-less mode, the ability to run Kong without the need of connecting to

Datong Sun

Developing a Kong Gateway Plugin With Go

Kong Logo
EngineeringApril 22, 2021

This tutorial shows you how to create a custom Kong Gateway plugin with Go programming language. The sample plugin I created adds an extra layer for security between consumers and producers. The way it works is it identifies consumers through a

Mert Simsek

Ready to see Kong in action?

Get a personalized walkthrough of Kong's platform tailored to your architecture, use cases, and scale requirements.

Get a Demo
Powering the API world

Increase developer productivity, security, and performance at scale with the unified platform for API management, AI gateways, service mesh, and ingress controller.

Sign up for Kong newsletter

    • Platform
    • Kong Konnect
    • Kong Gateway
    • Kong AI Gateway
    • Kong Insomnia
    • Developer Portal
    • Gateway Manager
    • Cloud Gateway
    • Get a Demo
    • Explore More
    • Open Banking API Solutions
    • API Governance Solutions
    • Istio API Gateway Integration
    • Kubernetes API Management
    • API Gateway: Build vs Buy
    • Kong vs Postman
    • Kong vs MuleSoft
    • Kong vs Apigee
    • Documentation
    • Kong Konnect Docs
    • Kong Gateway Docs
    • Kong Mesh Docs
    • Kong AI Gateway
    • Kong Insomnia Docs
    • Kong Plugin Hub
    • Open Source
    • Kong Gateway
    • Kuma
    • Insomnia
    • Kong Community
    • Company
    • About Kong
    • Customers
    • Careers
    • Press
    • Events
    • Contact
    • Pricing
  • Terms
  • Privacy
  • Trust and Compliance
  • © Kong Inc. 2026