Blog
  • AI Gateway
  • AI Security
  • AIOps
  • API Security
  • API Gateway
    • API Management
    • API Development
    • API Design
    • Automation
    • Service Mesh
    • Insomnia
    • View All Blogs
  1. Home
  2. Blog
  3. Engineering
  4. Troubleshooting an Intermittent Failure in CI Tests on ARM64
Engineering
October 4, 2023
6 min read

Troubleshooting an Intermittent Failure in CI Tests on ARM64

Zhongwei Yao
Topics
API GatewayAPI TestingAPI Development
Share on Social

More on this topic

eBooks

Maturity Model for API Management

eBooks

API Infrastructure: ESB versus API Gateway

See Kong in action

Accelerate deployments, reduce vulnerabilities, and gain real-time visibility. 

Get a Demo

The Kong Gateway CI was failing intermittently (about once every 100 runs) on the ARM64 platform with a strange error: “attempt to perform arithmetic on local 'i' (a function value)”. The variable i in the context is an integer but at runtime, it was sometimes a function value.

This is caused by an error in the LuaJIT ARM64 JIT compiler. We’ve investigated and found the issue and the fix is merged in the LuaJIT upstream. This document describes how we fixed the error.

Narrow down the scope: Is the error in Lua code or elsewhere?

The error when encountered was on line 3 in Kong's db init.lua code file.

Looking at the above code, the value i should be an integer, but the error says i is a function value. This error was only ever on ARM64 runners used in Kong CI. The first troubleshooting step was to check whether this error was caused by Lua code or other parts of Gateway — like the Nginx, LuaJIT, etc. This can be determined by disabling the JIT compilation for the function that contains the above code with: jit.off(new). After disabling JIT for the above function, the error was no longer reproducible. That means the error was caused by the underlying LuaJIT machine code generation.

Get deterministic results from random failure

One hurdle to debugging the failure was the random nature of this error. It was seen in about one out of every 100 CI runs. A minimal reproducible test case was needed to help speed up fixing the issue. The tool we used is called reverse debugger: Mozilla rr. If we use the analogy of a core dump being akin to taking a picture of the program when it fails, the record file generated by rr is like taking a video of the program when it fails. Like a real video file, you can play it back and forth, with the record file from rr, you can run the failure program back and forth and always get the same result. This helps tremendously when the failure happens randomly.The CI runs our unit tests with the busted framework and is invoked normally like busted spec/02-integration/03-db/01-db_spec.lua

The above commands run the test file again and again until it fails. And at each run, it is recorded by rr (by the –rr switch in resty) and the LuaJIT dump (by the -j dump switch and the output is redirected to a file) logging is also turned on to help identify which JIT trace caused the failure. Other switches like -c , --http-conf and environment variables are needed to run the test itself.

Debug the error with rr replay

Once the above command fails, we will have a file generated by rr. It usually is saved at ~/.local/share/rr/. And to replay this error, just run:

It will run in gdb like interface:

And we can run the “continue” command in gdb to run to the error:

In addition to running the “continue” command, we can also run the “reverse-continue” command to play back forwards, which will stop at any breakpoint. In this case, after searching for the error message in the LuaJIT source, we set a breakpoint at lj_err_optype and will give us more information about the error details:

And at this point, we can get information by printing tname and o and o’s value in this lj_err_optype function.

This confirms the value under pointer o is a function value already. It is an error to perform arithmetic operations on a function value in Lua language. 

The question now became where this value was overridden from a number value to a function value. This question could be answered in various ways. One approach is to set up a watch point in the gdb to the memory address that o is pointing to and do the reverse-continue. 

In our case, there is a bug in the rr tool itself, the reverse-continue does not work with the watch point. The original rr paper did not support ARM. The porting to ARM64 was only done recently after the ARM64 LSE platform became available on the market, which rr depends on. We were able to use the tool anyway by running the “next” command again and again to check when the value is overridden. We set up several breakpoints to speed up this process: 

  1. lj_trace_exit, this function does the transformation from JITed code to interpreter in LuaJIT. It is an important point that marks whether the error happens in JITed code or in the interpreter.
  2. TRACE_939, this the JITed code trace that contains the line init.lua:1441 in the error message. We can get this information from the JIT dump log file.

After setting up these breakpoints, we can run “continue” or “reverse-continue” in the gdb and examine the memory address that contains the wrong value regularly to search the code that writes the wrong value into that memory address.

Using this technique we found the location and it is in the following JITed code:

The stp x30, x0, [x19, #488] is the line of ARM64 assembly code that corrupts the memory. To fix this error, we need to deep dive into the LuaJIT JIT compiler and see what happens during the code generation.

Fix the error in the code generation

After searching the LuaJIT source code, there is only one place that can generate the “stp” instruction, the emit_lso function.

We can set a breakpoint on this function and check how the “stp” instruction is generated.

Basically, this function is doing two things:

  1. Generate regular load/store instruction for the load/store Intermediate Representation in the LuaJIT compiler.
  2. Try to merge consecutive load/store instructions when some conditions meet, like if the 2 stores have the same index register and the offset is adjacent, which is also called peephole optimization in compiler optimization.

In our failure case, this function has merged the following 2 store instructions:

This is not correct! 

Because the offset 488 and -16 are not adjacent. The next store position of 488 is:


The code treats real offset -16 the same as 496, which is wrong. The following explains why this happens.

And according to ARM64 instruction definition, the stur instruction is encoded like:

The offset (like -16) is stored in field imm9, which is short for immediate integer with 9 bits long. Imm9 can be either signed or unsigned depending how you interpret it. And 496 in the imm9 unsigned integer format is 1 1111 000. -16 in the imm9 signed integer format is also 1 1111 000. They are the same! This leads to passing the check at this line in above emit_lso function:

This causes the generation of the wrong instruction: stp x30, x0, [x19, #488]. To fix this error, we can add additional checks for the offset range before merging as we did in the submitted PR to LuaJIT upstream.

In retrospect, peephole optimization errors should always be triggered if there are such pairs of instructions. So why does it happen intermittently in our CI, which always has the same input in every run? 

The reason is Lua language itself has non-deterministic behaviors like the order of table iteration is undefined. And LuaJIT itself also has non-deterministic behaviors in the implementation, like the hot loop count, side trace penalty counter, etc. All of them will cause different JIT compilation behaviors even with the same input, especially if the test is large with thousands of JIT traces (in the above case, there are about 1500 JIT traces). These non-deterministic behaviors lead to intermittent failure.

Topics
API GatewayAPI TestingAPI Development
Share on Social
Zhongwei Yao

Recommended posts

Unlocking API Analytics for Product Managers

Kong Logo
EngineeringSeptember 9, 2025

Meet Emily. She’s an API product manager at ACME, Inc., an ecommerce company that runs on dozens of APIs. One morning, her team lead asks a simple question: “Who’s our top API consumer, and which of your APIs are causing the most issues right now?”

Christian Heidenreich

How to Build a Multi-LLM AI Agent with Kong AI Gateway and LangGraph

Kong Logo
EngineeringJuly 31, 2025

In the last two parts of this series, we discussed How to Strengthen a ReAct AI Agent with Kong AI Gateway and How to Build a Single-LLM AI Agent with Kong AI Gateway and LangGraph . In this third and final part, we're going to evolve the AI Agen

Claudio Acquaviva

How to Build a Single LLM AI Agent with Kong AI Gateway and LangGraph

Kong Logo
EngineeringJuly 24, 2025

In my previous post, we discussed how we can implement a basic AI Agent with Kong AI Gateway. In part two of this series, we're going to review LangGraph fundamentals, rewrite the AI Agent and explore how Kong AI Gateway can be used to protect an LLM

Claudio Acquaviva

How to Strengthen a ReAct AI Agent with Kong AI Gateway

Kong Logo
EngineeringJuly 15, 2025

This is part one of a series exploring how Kong AI Gateway can be used in an AI Agent development with LangGraph. The series comprises three parts: Basic ReAct AI Agent with Kong AI Gateway Single LLM ReAct AI Agent with Kong AI Gateway and LangGr

Claudio Acquaviva

Build Your Own Internal RAG Agent with Kong AI Gateway

Kong Logo
EngineeringJuly 9, 2025

What Is RAG, and Why Should You Use It? RAG (Retrieval-Augmented Generation) is not a new concept in AI, and unsurprisingly, when talking to companies, everyone seems to have their own interpretation of how to implement it. So, let’s start with a r

Antoine Jacquemin

AI Gateway Benchmark: Kong AI Gateway, Portkey, and LiteLLM

Kong Logo
EngineeringJuly 7, 2025

In February 2024, Kong became the first API platform to launch a dedicated AI gateway, designed to bring production-grade performance, observability, and policy enforcement to GenAI workloads. At its core, Kong’s AI Gateway provides a universal API

Claudio Acquaviva

Scalable Architectures with Vue Micro Frontends: A Developer-Centric Approach

Kong Logo
EngineeringJanuary 9, 2024

In this article, which is based on my talk at VueConf Toronto 2023, we'll explore how to harness the power of Vue.js and micro frontends to create scalable, modular architectures that prioritize the developer experience. We'll unveil practical strate

Adam DeHaven

Ready to see Kong in action?

Get a personalized walkthrough of Kong's platform tailored to your architecture, use cases, and scale requirements.

Get a Demo
Powering the API world

Increase developer productivity, security, and performance at scale with the unified platform for API management, AI gateways, service mesh, and ingress controller.

Sign up for Kong newsletter

Platform
Kong KonnectKong GatewayKong AI GatewayKong InsomniaDeveloper PortalGateway ManagerCloud GatewayGet a Demo
Explore More
Open Banking API SolutionsAPI Governance SolutionsIstio API Gateway IntegrationKubernetes API ManagementAPI Gateway: Build vs BuyKong vs PostmanKong vs MuleSoftKong vs Apigee
Documentation
Kong Konnect DocsKong Gateway DocsKong Mesh DocsKong AI GatewayKong Insomnia DocsKong Plugin Hub
Open Source
Kong GatewayKumaInsomniaKong Community
Company
About KongCustomersCareersPressEventsContactPricing
  • Terms•
  • Privacy•
  • Trust and Compliance•
  • © Kong Inc. 2025