[Kong Gateway](/blog/tag/kong-gateway)Kong Gateway

June 10, 2026

9 min read

Datong Sun

Each change to Kong Gateway's codebase triggers a comprehensive test suite that runs more than 17,000 * 2 = 34,000 test cases among the two primary architectures (x86 and ARM) we support. This process takes about 23.5 hours on a single machine. But we don't wait that long. A large fleet of machines runs the suite in parallel, and we shard the work aggressively so each commit finishes in a fraction of that time. That setup works well, right up until flaky tests get involved.

On the Kong engineering team, we've been exploring how an agentic AI workflow can make developing Kong Gateway more efficient. Fixing flaky tests turned out to be the ideal place to start.

## The problem

Quality is at the core of every product we build. To move fast without compromising on quality, we rely on an extensive set of test suites that tell us quickly whether a change has caused a regression. Those suites are large enough that a single full run takes about 23.5 hours with a single machine, which is why we spread the work across a large fleet of machines.

But as we ship more features, the suite keeps growing, and more flaky tests start showing up.

For the uninitiated, flaky tests are tests that fail sometimes but usually pass on a rerun. They're unavailable on large projects and a constant drain on engineers: after a flaky failure, you have to rerun the job, which costs both engineering time and machine time. Worse, as the flaky rate climbs, engineers trying to merge PRs start stepping on each other during reruns, creating a feedback loop that can consume the entire CI fleet’s capacity. Once that happens, queueing time spikes — sometimes engineers wait more than 30 minutes to an hour before a test even starts.

## Flaky tests are hard to fix

The obvious question is: **why not just ask engineers to fix the flaky tests?**

We did. It never worked well. And we realized there are two main reasons behind this:

- First,** the reason a test is flaky is not always obvious.** Finding it often takes a deep understanding of both the feature under test and its implementation. Even the test's original author usually can't spot the real cause — if they could, they wouldn't have written a flaky test in the first place.
- Second, even once an engineer identifies the cause, **fixing it properly takes extensive validation**. Because the failure is intermittent, running the fixed test once proves nothing. Engineers end up babysitting a PR for hours, rerunning it many times to confirm the flakiness is actually gone.

All of this makes fixing flaky tests an expensive investment. It often takes one experienced engineer a full day to fix two of them. It's also a battle we can't win outright. As we build more features, we have to keep pouring engineering time into keeping CI flakiness at an acceptable level, or risk blocking new development entirely and rendering CI unusable.

## Can we use AI to fix flaky tests?

By now you can probably see why flaky tests are expensive for any large engineering project, and expensive for humans to fix. When agentic coding went mainstream, this was one of the first use cases we considered for, because:

- Agents can scan far more of the codebase and logs than a human can, and quickly — a better shot at finding a flaky test's root cause.
- Agents can babysit the validation of a fix for hours without intervention, which humans are terrible at.

So we ran an experiment attempting to use Claude Code to fix the flaky tests in Kong Gateway.

## The identify-fix-verify loop

The first thing we needed was a way to identify which tests were flaky and how often they failed. Luckily, the team had already built a dashboard on top of Datadog's CI Visibility feature that gives us a clear picture of the flakiest tests and their failure rates.

We used that dashboard as the benchmark for our fixes, and used its list to seed the agents.

Here's the agentic workflow we settled on after a few rounds:

At the top level is the **fix-flakes** skill, the orchestrating agent, which runs on Opus. It takes a seed file — a flaky test's Datadog CSV export — and reads it to download all the CI logs from that test's recent failures. This step matters: failure logs usually contain vital clues about what's going wrong (as an example, a test failure after running for exactly 20 seconds could signal agents to look closely at timeout conditions), and they keep the agent grounded instead of guessing and hallucinating.

Once the failing test is identified and its logs are downloaded, **fix-flakes** spawns an Opus **flake-fixer** subagent to investigate and produce a fix. This is usually the most expensive step in token terms, but it generally runs in under 10 minutes — the **flake-fixer** often has to sift through multiple log files and dozens of source and test files to form a hypothesis and write a fix.

When **flake-fixer** is satisfied, it creates a branch, commits the fix, and pushes it to the repository. It then reports a summary of its theory and reports back to the orchestrator and exits.

The orchestrator then launches a **flake-verifier** subagent to check the fix. **flake-verifier** is a smaller Haiku agent, because each rerun produces more than one failure — there are other flaky tests in the suite — and the agent has to decide whether the test in question is fixed while filtering out the rest as noise. The CI output is scattered across many unstructured log files, and we found Haiku to be the easiest and most reliable way to judge whether a rerun succeeded. flake-verifier polls the run periodically until it can call success or failure, then reports back.

The orchestrator keeps spawning **flake-verifier** subagents to match the "rerun streak" configured when the skill is invoked. If no failure shows up across the full streak, **fix-flakes** open a PR for review. If a **flake-verifier** reports a recurrence at any point, the orchestrator downloads and combines the new failure logs with the previous **flake-fixer** summaries, spawns a fresh **flake-fixer**, and repeats — until the test is fixed and verified, or a maximum attempt count is reached, at which point it reports the failure.

Finally, the verified PR goes through normal human code review, same as any PR a person opens, and gets merged.

## The need to preserve context

Why design the loop this way? Mostly for context management. Keeping a model's context concise helps it hold attention, and it sharply reduces the cost of running a long agentic task like this.

**flake-fixer** is the most context-heavy agent in the system: it has to read through large log files and source code to build a solid theory of the flakiness. Carrying every previous fix along wouldn't help the next investigation much, but it would fill the context window fast. So we ask each **flake-fixer** to produce exactly one analysis and one fix, and to hand a summary back to the orchestrator — that way the next **flake-fixer** knows what's already been tried and can reach for new theories.

Subagents also let us use smaller, cheaper models like Haiku for steps that don't need heavy reasoning, like verifying a result, which cuts costs further.

With this workflow, one flaky test can generally be fixed and verified in 4–5 hours — most of that spent in the verification loop, since CI itself takes a while — using an average of 100–200k tokens start to finish. The process is highly automated and needs no human intervention while the agents run. Fixing several tests at once is just a matter of running multiple instances of Claude Code with different seed files, all running concurrently and not interfering with each other.

## How well does this work?

Over a week and a half, we ran this harness against the **15** flakiest tests on our dashboard. It fixed **12** of them and gave up on **3**. That result is striking, given the same work would often take an experienced engineer multiple weeks, often with less success.

It also did something we didn't ask for: while investigating the failures, the workflow surfaced two genuine bugs in our codebase.

**The **`**conf_loader**`** sorting bug**

One test compares the output of our `conf_loader `against an expected table, and it had been failing at random for years. Engineers had tried to fix it before, but it never stayed fixed. The failure looked like this (trimmed for length):

FAIL spec/01-unit/04-prefix_handler_spec.lua:994: NGINX conf compiler prepare_prefix() dumps Kong conf
spec/01-unit/04-prefix_handler_spec.lua:997: Expected objects to be the same.
Passed in:
(table: 0x7fdcaafdf140) {
  *[nginx_http_directives] = {
     [7] = {
       [name] = 'lua_shared_dict'
      *[value] = 'prometheus_metrics 5m' }
     [8] = {
       [name] = 'lua_shared_dict'
       [value] = 'otel_metrics 5m' } }
Expected:
(table: 0x7fdcad507ba0) {
  *[nginx_http_directives] = {
     [7] = {
       [name] = 'lua_shared_dict'
      *[value] = 'otel_metrics 5m' }
     [8] = {
       [name] = 'lua_shared_dict'
       [value] = 'prometheus_metrics 5m' } } }

Rather than just reorder the output, the harness investigated the codebase and the `conf_loader` source and realized the loader wasn't sorting duplicate directives by their value. Combined with the nondeterministic ordering of hash-table enumeration, that left the ordering of duplicate `lua_shared_dict` directives unstable. The fix it produced:

Our reviewer agreed this was the root cause, and our monitoring dashboard confirmed it: after the change, this five-year-old flaky test never recurred. All of it came out of the agentic loop — no human ever gave Claude any guidance.

### The concurrent auth race condition

The workflow turned up another genuine bug — two, actually. The test in question spawns 10 clients that hit our Admin auth endpoint at once. Occasionally, one of the requests fails, returning no session when it should.

Claude first suspected our `cluster_mutex` module, and it did find a small bug there: the mutex didn't return quickly when it was already held by a local request. But that fix did **not** pass the verification streak, so **flake-fixer** was spawned again to dig deeper. This time it identified a subtle data race in the session plugin the test relies on, and produced a fix that, once merged, made the flakiness disappear entirely.

In both cases, the agent fixed the flaky test without touching the test files at all. Through log and code analysis, it concluded the tests were written correctly, and the problems lay elsewhere. That's something humans tend to miss — we assume flakiness comes from "bad tests" and stop investigating the code path under test.

The whole Gateway team is feeling the difference. For the first time, CI stability is improving fast enough to be obvious. Merging a PR has gotten noticeably easier because far fewer reruns are needed — which saves a lot of engineering time and meaningfully cuts both the utilization and the queueing of our CI fleet.

## Lessons learned

**Try it on your flaky tests.** If you're still unsure whether agentic coding is a good idea, point it at your flaky test suite and see for yourself. You can't really make things worse — however badly it goes, it won't touch the product. It's about as risk-free as an experiment gets.

**Build a good feedback loop for your agent.** The only reason our flake fixer worked so well is that we told it exactly how to gather failure logs, understand the code path, and run verifications. Without those steps, even the most advanced model will hallucinate — convinced it fixed the issue while the test stays flaky. Agents are smart and will do their best with what you give them. What they're not good at is asking for the things you should be providing but aren't. That, more than anything, is why our flake fixer is effective enough to surface bugs nobody had noticed.

**Help the agent keep its context small.** When there's no reason to carry full context between tasks — like between **flake-fixer** spawns — start fresh and pass along only what the next step needs. It saves substantial time and tokens, and it lets the agent focus on the new investigation and produce a better analysis and fix.

**Use the right agent for the job.** Advanced reasoning models like Opus are worth it for complex code and flakiness analysis; for verification, a smaller model like Haiku does the job just as well.

**Don't over-engineer your harness.** It's tempting to invest heavily up front — a comprehensive harness, a library of polished, general-purpose scripts and skills meant to cover every scenario the agent might encounter. We found the opposite works better. Let the agent write throwaway scripts and one-off tools for the task in front of it; the models are more than capable of building the right thing on the spot, and a tool shaped for one specific job usually beats a generic one trying to handle them all. It also lifts the success rate — an agent running a script it just wrote for a particular task at hand results in far less troubleshooting than one trying to understand and run a general-purpose script built ahead of time, and often results in less token usage overall. Spend your effort where it actually pays off: on the methodology for identifying and verifying flaky tests, and on where to get the data that backs those decisions. That's a far better investment than prescribing the exact script an agent must use to achieve a certain task like rerunning a workflow or parsing the job output.

Overall, we've been pleasantly surprised by this experiment, and it already became part of our development workflow. It has meaningfully improved our productivity and our confidence in the test suite, and also significantly cut our engineering and CI costs.

**Topics**

- [Kong Gateway](/blog/tag/kong-gateway)Kong Gateway- [Developer Experience](/blog/tag/developer-experience)Developer Experience- [Automation](/blog/tag/automation)Automation- [Agentic AI](/blog/tag/agentic-ai)Agentic AI- [AI](/blog/tag/ai)AI

Datong Sun

# Insomnia’s Agent-Friendly CLI tool, Koh: A Standardized Interface Between Your Agent and Your API Context

[Product Releases](/blog/tag)Product ReleasesJuly 15, 2026

Insomnia already holds a lot of the context that makes APIs usable: specs, environments, governance rules, requirements, and everything synced in from Kong Konnect (available as of Insomnia 13 ). Koh is a CLI that gives agents a standardized, secur

Haley Giuliano

# Kong Identity Principals: One Record to Govern Every API, Event, and Application Identity

[Product Releases](/blog/tag)Product ReleasesJuly 14, 2026

Modern enterprise APIs don't live in one place. They're spread across multiple gateways, deployed across regions and clouds, and accessed by a growing mix of consumers, internal developers, external partners, machine-to-machine services, and increas

Amit Shah

# kongctl 1.0: A Declarative, AI-Native CLI for Kong Konnect

[Product Releases](/blog/tag)Product ReleasesJune 12, 2026

kongctl was built with the modern development stack in mind. Developers working in a terminal want the ability to query systems quickly to verify state, behaviors, and capabilities. Coding harnesses need well defined tools and accurate schemas for

Rick Spurgeon

# How to Test Gateway APIs Directly from Kong Konnect with Insomnia

[Engineering](/blog/tag)EngineeringJune 22, 2026

What You'll Build To explore the new integration, I'll build a realistic API platform workflow using Konnect, Kong Gateway, and Insomnia. By the end of this tutorial, I'll have: A Konnect Control Plane (KongAir Dev) A local Kong Gateway Data Pl

Juhi Singh

# Automating Agreement Workflows with Kong Konnect and Docusign for Developers

[Engineering](/blog/tag)EngineeringApril 16, 2026

Traditional agreement processes were slow and heavily manual. Documents were often created in office tools, shared through email, printed, signed physically, and stored across multiple systems. Tracking the status of agreements required manual follo

Paige Rossi

# Practical Strategies to Monetize AI APIs in Production

[Engineering](/blog/tag)EngineeringMarch 27, 2026

Traditional APIs are, in a word, predictable. You know what you're getting: Compute costs that don't surprise you Traffic patterns that behave themselves Clean, well-defined request and response cycles AI APIs, especially anything that runs on LLMs

Deepanshu Pandey

# Insomnia’s Agent-Friendly CLI tool, Koh: A Standardized Interface Between Your Agent and Your API Context

[Product Releases](/blog/tag)Product ReleasesJuly 15, 2026

Haley Giuliano

# Kong Identity Principals: One Record to Govern Every API, Event, and Application Identity

[Product Releases](/blog/tag)Product ReleasesJuly 14, 2026

Amit Shah

# Kong AI/MCP Gateway and Kong MCP Server Technical Breakdown

[Engineering](/blog/tag)EngineeringDecember 11, 2025

In the latest Kong Gateway 3.12 release , announced October 2025, specific MCP capabilities have been released: AI MCP Proxy plugin: it works as a protocol bridge, translating between MCP and HTTP so that MCP-compatible clients can either call exi

Jason Matis

# Insomnia’s Agent-Friendly CLI tool, Koh: A Standardized Interface Between Your Agent and Your API Context

[Product Releases](/blog/tag)Product ReleasesJuly 15, 2026

Haley Giuliano

# Kong Identity Principals: One Record to Govern Every API, Event, and Application Identity

[Product Releases](/blog/tag)Product ReleasesJuly 14, 2026

Amit Shah

# How We Used Agentic AI to Fix Kong Gateway's Flakiest Tests

## The problem

## Flaky tests are hard to fix

## Can we use AI to fix flaky tests?

## The identify-fix-verify loop

## The need to preserve context

## How well does this work?

### The concurrent auth race condition

## Lessons learned

Recommended posts

# kongctl 1.0: A Declarative, AI-Native CLI for Kong Konnect

# How to Test Gateway APIs Directly from Kong Konnect with Insomnia

# Automating Agreement Workflows with Kong Konnect and Docusign for Developers

# Practical Strategies to Monetize AI APIs in Production

# Insomnia’s Agent-Friendly CLI tool, Koh: A Standardized Interface Between Your Agent and Your API Context

# Kong Identity Principals: One Record to Govern Every API, Event, and Application Identity

# Kong AI/MCP Gateway and Kong MCP Server Technical Breakdown

# kongctl 1.0: A Declarative, AI-Native CLI for Kong Konnect

# How to Test Gateway APIs Directly from Kong Konnect with Insomnia

# Automating Agreement Workflows with Kong Konnect and Docusign for Developers

# Practical Strategies to Monetize AI APIs in Production

# Insomnia’s Agent-Friendly CLI tool, Koh: A Standardized Interface Between Your Agent and Your API Context

# Kong Identity Principals: One Record to Govern Every API, Event, and Application Identity

# Kong AI/MCP Gateway and Kong MCP Server Technical Breakdown

# kongctl 1.0: A Declarative, AI-Native CLI for Kong Konnect

# How to Test Gateway APIs Directly from Kong Konnect with Insomnia

# Automating Agreement Workflows with Kong Konnect and Docusign for Developers

# Practical Strategies to Monetize AI APIs in Production

# Insomnia’s Agent-Friendly CLI tool, Koh: A Standardized Interface Between Your Agent and Your API Context

# Kong Identity Principals: One Record to Govern Every API, Event, and Application Identity

# Kong AI/MCP Gateway and Kong MCP Server Technical Breakdown

## Ready to see Kong in action?

## step-0