When seemingly random test failures happen, it can be tempting to simply run again and move on. But, as the following example shows, it can be important to not overlook potential issues without investigating to determine the cause of the symptom. In the case of our story, by digging a little deeper we were able to catch a minor issue that could have been a big issue in production for customers.
Issue Spotted
One day, I was working on migrating the Kong Manager CI from Jenkins to GitHub Actions. During the migration, I came across something outside the testing framework that prevented the tests from being executed correctly:
The log emitted by the GitHub Actions runner showed that the testing framework was failing with errors like "connect ECONNREFUSED 127.0.0.1:8001." What this means is that Kong Manager cannot communicate with the Kong instance while running the tests. And it appears that all test cases in that particular job encountered the same issue. It appears that something was preventing the Kong instance from starting correctly.
The first suspect was the database, which might not have been ready when the Kong instance was brought up. This is especially noticeable with Cassandra, which usually takes a long time to startup. To test this theory, I let the test runner sleep for about 30 seconds before starting the Kong container. However, the Kong container still exited after a short delay.
To understand why the Kong process was exiting, the docker inspect command was used to check the state of the container. As shown below, the container exited with exit code 132. Some quick Googling shows that this unusual exit code means the process was terminated upon receiving the SIGILL (illegal instruction) signal.
Waiting for Kong to start... (8/10)
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":132,"Error":"", ...}Waiting for Kong to start... (9/10)
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":132,"Error":"", ...}Waiting for Kong to start... (10/10)
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":132,"Error":"", ...}
Failed waiting for Kong to start
Into the Core
This uncommon signal means the CPU encountered invalid instructions while executing the program. To understand which exact instruction had raised this signal, the kernel message buffer was printed with the dmesg command:
traps: nginx[2502] trap invalid opcode ip:7f65b2d8d2a0 sp:7ffd2104d880 error:0 in libgmp.so.10.4.1[7f65b2d5f000+275000]
From the above message, we can tell that there were invalid instructions in libgmp.so that shipped with this Kong internal nightly Docker image. GMP is a free library for arbitrary precision arithmetic. It powers the JWT and OpenID Connect features in Kong. By calculating the difference between the address held by the instruction pointer (IP) 0x7f65b2d8d2a0 and the base address 0x7f65b2d5f000, it is possible to know that the bad instruction resides at the offset of 0x2e2a0 in libgmp.so.10.4.1.
With the relative offset known, the shared library file can be disassembled with the objdump -D libgmp.so.10.4.1 command, the following instruction showed up at that offset:
2e2a0:66 4d 0f 38 f6 d2 adcx r10, r10
The first thing that caught my eye was the unusual ADCX instruction at this location. Some quick Googling showed that ADCX belongs to ADX — Intel’s arbitrary-precision arithmetic extension to the x86 instruction set, and Broadwell is the first microarchitecture that supported it. Could it be that the CPU used by the GitHub Actions runner was older and lacked the support for ADX? To confirm the CPU model the GitHub Actions runner used, the content in /proc/cpuinfo was printed out:
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
From the Intel website, it turns out that the runner assigned by GitHub was using an Intel Xeon E5-2673 v3 processor launched in 2013 with the Haswell microarchitecture.
As previously mentioned, Broadwell is the first microarchitecture that introduced support for the ADX instruction set, which means that the processor used by the runner indeed could not understand the newer ADCX instruction. The reason why the program crashed on this instruction was apparent. It could be an issue for our customers running Kong in production since we cannot know which CPU our customers will choose to use ahead of time. Therefore, introducing a new instruction set is unsuitable for Kong's use cases.
Now we know what the issue was, the next question is: how long has this issue existed in our codebase?
After inspecting the libgmp binary shipped within Docker images we built in previous months, it seemed these GMP libraries did not contain instructions from the ADX extension, except for the latest internal nightly preview at the debugging time. It could be possible that the build options or environment for GMP had changed, which led to the ADX extension being introduced during the building process.
As for this issue, my coworker mentioned that we recently changed the Amazon AWS instance type we used to run our build pipeline to a larger instance with a newer CPU model.
In this case, the libgmp build script tries to be smart and use the processor type to determine which processor extensions are available. This is usually desirable — as it could use newer and faster CPU instructions to accelerate numerical computations — but it also had the side effect of introducing instructions like ADX which older CPUs do not support.
To address the issue, we need to explicitly tell the libgmp build script not to make optimization decisions based on the CPU model of the build machine. Luckily, the libgmp documentation contains a "Build Options" page that explains exactly how to do that.
At last, we added a "--build=$(uname -m)-linux-gnu" option to specify a less particular processor type to fix this issue so that the compiled binary does not contain instruction sets that the old processors do not support. After this change, we disassembled the binary file again and confirmed the compiled GMP library no longer includes the ADX set of instructions, and the issue was fixed before the upcoming customer-facing release was shipped.
Conclusion
Sometimes, seeing is not entirely believed. When random test failures like these happen, oftentimes it is tempting to just "hit the re-run button" and call it a day.
But as this example shows, it is worth digging deeper to understand the cause of the symptom. By doing so, we caught the subtle issue caused by a seemingly innocent build machine instance type bump and avoided becoming a production issue for our customers before the next release.
This is just one example of the efforts we at Kong make to continuously monitor and improve the quality of the software we ship. As we always strive to achieve, the end result is a more stable foundation that our customers can rely on.
Key Takeaways API testing is crucial for ensuring the reliability, security, and performance of modern applications. Different types of testing, such as functional, security, performance, and integration testing, should be employed to cover all aspe
Adam Bauman
6 Reasons Why Kong Insomnia Is Developers' Preferred API Client
So, what exactly is Kong Insomnia? Kong Insomnia is your all-in-one platform for designing, testing, debugging, and shipping APIs at speed. Built for developers who need power without bloat, Insomnia helps you move fast whether you’re working solo,
Juhi Singh
Kong Gateway Performance Benchmarks and Open Source Test Suites
In the rapidly evolving landscape of API management, understanding the raw performance and reliability of your API gateway is not just an expectation — it's a necessity. At Kong, we're dedicated to ensuring our users have access to concrete, action
Kong
Kong Konnect DP Node Autoscaling with Karpenter on Amazon EKS 1.29
In this post, we're going to explore Karpenter, the ultimate solution for Node Autoscaling. Karpenter provides a cost-effective capability to implement your Kong Konnect Data Plane layer using the best EC2 Instances Types options available for your
Claudio Acquaviva
Kong Konnect DP Node Autoscaling with Cluster Autoscaler on AWS EKS 1.29
After getting our Konnect Data Planes vertically and horizontally scaled, with VPA and HPA , it's time to explore the Kubernete Node Autoscaler options. In this post, we start with the Cluster Autoscaler mechanism. (Part 4 in this series is dedic
Claudio Acquaviva
Kong Konnect Data Plane Elasticity on Amazon EKS 1.29: Pod Autoscaling with VPA
In this series of posts, we will look closely at how Kong Konnect Data Planes can take advantage of Autoscalers running on Amazon Elastic Kubernetes Services (EKS) 1.29 to support the throughput the demands API consumers impose on it at the lowest c
Claudio Acquaviva
Troubleshooting an Intermittent Failure in CI Tests on ARM64
The Kong Gateway CI was failing intermittently (about once every 100 runs) on the ARM64 platform with a strange error: “attempt to perform arithmetic on local 'i' (a function value)”. The variable i in the context is an integer but at runtime, it w
Zhongwei Yao
Ready to see Kong in action?
Get a personalized walkthrough of Kong's platform tailored to your architecture, use cases, and scale requirements.