Makito Yu

By on October 31, 2022

Troubleshooting: A Journey Into the Core

When seemingly random test failures happen, it can be tempting to simply run again and move on. But, as the following example shows, it can be important to not overlook potential issues without investigating to determine the cause of the symptom. In the case of our story, by digging a little deeper we were able to catch a minor issue that could have been a big issue in production for customers.

Issue Spotted

One day, I was working on migrating the Kong Manager CI from Jenkins to GitHub Actions. During the migration, I came across something outside the testing framework that prevented the tests from being executed correctly:

GET http://localhost: 8001/admins connect ECONNREFUSED 127.0.0.1:8001

GET http://localhost:8001/workspaces connect ECONNREFUSED 127.0.0.1:8001

POST http://localhost:8001/rbac/roles connect ECONNREFUSED 127.0.0.1:8001

POST http://localhost:8001/default/admins connect ECONNREFUSED 127.0.0.1:8001

A Brief Inspection

The log emitted by the GitHub Actions runner showed that the testing framework was failing with errors like “connect ECONNREFUSED 127.0.0.1:8001.” What this means is that Kong Manager cannot communicate with the Kong instance while running the tests. And it appears that all test cases in that particular job encountered the same issue. It appears that something was preventing the Kong instance from starting correctly.

The first suspect was the database, which might not have been ready when the Kong instance was brought up. This is especially noticeable with Cassandra, which usually takes a long time to startup. To test this theory, I let the test runner sleep for about 30 seconds before starting the Kong container. However, the Kong container still exited after a short delay.

To understand why the Kong process was exiting, the docker inspect command was used to check the state of the container. As shown below, the container exited with exit code 132. Some quick Googling shows that this unusual exit code means the process was terminated upon receiving the SIGILL (illegal instruction) signal.

Waiting for Kong to start... (8/10)
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":132,"Error":"", ...}

Waiting for Kong to start... (9/10)
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":132,"Error":"", ...}

Waiting for Kong to start... (10/10)
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":132,"Error":"", ...}

Failed waiting for Kong to start

Into the Core

This uncommon signal means the CPU encountered invalid instructions while executing the program. To understand which exact instruction had raised this signal, the kernel message buffer was printed with the dmesg command:

traps: nginx[2502] trap invalid opcode ip:7f65b2d8d2a0 sp:7ffd2104d880 error:0 in libgmp.so.10.4.1[7f65b2d5f000+275000]

From the above message, we can tell that there were invalid instructions in libgmp.so that shipped with this Kong internal nightly Docker image. GMP is a free library for arbitrary precision arithmetic. It powers the JWT and OpenID Connect features in Kong. By calculating the difference between the address held by the instruction pointer (IP) 0x7f65b2d8d2a0 and the base address 0x7f65b2d5f000, it is possible to know that the bad instruction resides at the offset of 0x2e2a0 in libgmp.so.10.4.1.

With the relative offset known, the shared library file can be disassembled with the objdump -D libgmp.so.10.4.1 command, the following instruction showed up at that offset:

2e2a0: 66 4d 0f 38 f6 d2 adcx r10, r10

The first thing that caught my eye was the unusual ADCX instruction at this location. Some quick Googling showed that ADCX belongs to ADX — Intel’s arbitrary-precision arithmetic extension to the x86 instruction set, and Broadwell is the first microarchitecture that supported it. Could it be that the CPU used by the GitHub Actions runner was older and lacked the support for ADX? To confirm the CPU model the GitHub Actions runner used, the content in /proc/cpuinfo was printed out:

Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz

From the Intel website, it turns out that the runner assigned by GitHub was using an Intel Xeon E5-2673 v3 processor launched in 2013 with the Haswell microarchitecture.

As previously mentioned, Broadwell is the first microarchitecture that introduced support for the ADX instruction set, which means that the processor used by the runner indeed could not understand the newer ADCX instruction. The reason why the program crashed on this instruction was apparent. It could be an issue for our customers running Kong in production since we cannot know which CPU our customers will choose to use ahead of time. Therefore, introducing a new instruction set is unsuitable for Kong’s use cases.

Now we know what the issue was, the next question is: how long has this issue existed in our codebase?

After inspecting the libgmp binary shipped within Docker images we built in previous months, it seemed these GMP libraries did not contain instructions from the ADX extension, except for the latest internal nightly preview at the debugging time. It could be possible that the build options or environment for GMP had changed, which led to the ADX extension being introduced during the building process.

As for this issue, my coworker mentioned that we recently changed the Amazon AWS instance type we used to run our build pipeline to a larger instance with a newer CPU model.

In this case, the libgmp build script tries to be smart and use the processor type to determine which processor extensions are available. This is usually desirable — as it could use newer and faster CPU instructions to accelerate numerical computations — but it also had the side effect of introducing instructions like ADX which older CPUs do not support.

To address the issue, we need to explicitly tell the libgmp build script not to make optimization decisions based on the CPU model of the build machine. Luckily, the libgmp documentation contains a “Build Options” page that explains exactly how to do that.

At last, we added a “--build=$(uname -m)-linux-gnu” option to specify a less particular processor type to fix this issue so that the compiled binary does not contain instruction sets that the old processors do not support. After this change, we disassembled the binary file again and confirmed the compiled GMP library no longer includes the ADX set of instructions, and the issue was fixed before the upcoming customer-facing release was shipped.

Conclusion

Sometimes, seeing is not entirely believed. When random test failures like these happen, oftentimes it is tempting to just “hit the re-run button” and call it a day.

But as this example shows, it is worth digging deeper to understand the cause of the symptom. By doing so, we caught the subtle issue caused by a seemingly innocent build machine instance type bump and avoided becoming a production issue for our customers before the next release.

This is just one example of the efforts we at Kong make to continuously monitor and improve the quality of the software we ship. As we always strive to achieve, the end result is a more stable foundation that our customers can rely on.

Share Post