[AI Gateway](/blog/tag/ai-gateway)AI Gateway

June 17, 2026

7 min read

Hugo Guerrero

Principal Tech PMM, Kong

*This is part of a three-part series. For the full story, see *[*Your AI Agent Knows What, It Doesn't Know Why*](https://konghq.com/blog/enterprise/durable-commit-log-ai-observability)*Your AI Agent Knows What, It Doesn't Know Why** and *[*Building the Agentic Commit Log*](https://konghq.com/blog/engineering/agentic-commit-log-kafka-kong)*Building the Agentic Commit Log**.*

Once you accept that [agentic AI systems need a durable commit log](https://konghq.com/blog/enterprise/durable-commit-log-ai-observability)agentic AI systems need a durable commit log — an ordered, immutable record of every tool call, decision, and context shift — the next question is the one architects actually argue about: which log?

Vector databases aren't logs. Key-value stores aren't logs. Your existing observability stack isn't a log in the sense that matters here. The agent's memory backbone needs specific properties, and not every event streaming system delivers all of them at the level of confidence that production agentic systems require. This post makes the case for Apache Kafka specifically — not because it's the most familiar name in the space, but because it was built around precisely the properties this workload depends on.

## The properties that matter

Before making the Kafka argument, it's worth being precise about what the memory log actually needs to do. A durable commit log for agentic AI isn't just a message bus. It's the substrate that makes replay, auditability, and governance possible. That places specific demands on the protocol.

**It needs to retain data for months or years, not minutes or hours.** Agent action logs aren't transient operational events. They're evidence. When a regulated workflow runs in June and an auditor asks about it in December, the log needs to have the answer. When a model improvement team wants to replay interactions from three months ago to test a new configuration, the history needs to be there. The retention model has to be designed for long-horizon storage, not optimized for throughput and then bolted with an archival afterthought.

**It needs strict, verifiable ordering.** Replay is only meaningful if you can reconstruct the exact sequence of events as they originally occurred. That requires per-partition ordering guarantees that are strong and simple to reason about — not probabilistic, not "usually ordered," not "ordered within a time window." The agent's reasoning chain is causal: event A caused event B caused event C. The log has to preserve that causality with certainty, not inference.

**It needs schema governance that compounds over time.** An action log full of inconsistently structured or malformed events is useless for downstream consumers, replay, or audit. Schemas need to be versioned, compatibility needs to be enforced at write time, and the governance infrastructure needs to be mature enough to handle the schema evolution that comes with any long-lived system. This isn't something you can retrofit. It has to be built in from the start.

**It needs an ecosystem that connects to everything downstream.** The log isn't the end state. It feeds vector databases, compliance stores, analytics warehouses, SIEM (Security Information and Event Management) platforms, quality evaluation loops, and model training pipelines. The data movement layer between the log and these downstream systems needs to be reliable, governed, and — ideally — something you configure rather than build.

## Why Kafka, specifically

Kafka was designed around the log as a first-class abstraction. Topics retain data indefinitely by default; deletion and compaction are explicit choices you make, not the default behavior. This is the right default for an agent memory layer. The log accumulates. You keep it. You make intentional decisions about what to expire and when.

Tiered storage — available in Apache Kafka 3.6 and above, and in managed offerings like Confluent, Redpanda, and WarpStream — makes years of retention economically practical by moving older log segments to cheap object storage while keeping them queryable. For an agentic system that might need to replay interactions from a year ago to support an audit or a training run, tiered storage means the economics of retention are a configuration decision, not an infrastructure crisis.

Kafka's ordering model is the cleanest implementation of what an event log needs to be. Within a partition, events are strictly ordered by offset — a monotonically increasing integer assigned at write time. An agent action at offset 4,283,019 happened after the action at offset 4,283,018. You can point to a specific moment in the agent's history with a single number. You can replay from that exact moment. You can fork from that exact moment. You can give an auditor a precise, verifiable reference to the agent's state at a specific point in a session. This sounds simple. It's not a given. Several streaming systems make ordering guarantees that are weaker, more conditional, or harder to operationalize. Kafka makes the strong guarantee, and the surrounding tooling assumes it.

The `session_id`-to-partition mapping is the practical expression of this guarantee in an agentic context. When you partition by `hash(session_id)`, every event from a single agent session — every tool call, every reasoning step, every decision, every model response — lands in the same partition in the order it occurred. The causal chain of an agent's reasoning is preserved at the storage layer, before any consumer touches it, without any special logic in the agent itself.

Schema governance in the Kafka ecosystem is mature in a way that matters. A decade of production experience with schema registries (Confluent Schema Registry, Apicurio), compatibility rules (`BACKWARD`, `FORWARD`, `FULL`), and governance workflows has produced tooling and operational practices that work reliably at scale. AsyncAPI 3.1 — the specification for event-driven APIs, roughly the equivalent of OpenAPI for async communication — has strong Kafka support and is the right contract layer for agentic event topics. When you enforce `BACKWARD` compatibility, a new version of the `agent.tool.invoked` schema doesn't break your compliance consumer or your SIEM connector, because it can still be read by systems built against the previous version. Schema drift silently destroys the value of a long-lived action log. Kafka's governance ecosystem exists specifically to prevent it.

The downstream ecosystem argument is the one that's easiest to underestimate. When you choose Kafka as your memory backbone, you're not just choosing a protocol for storing events. You're choosing access to Kafka Connect with hundreds of pre-built source and sink connectors, Flink for stateful stream processing, ksqlDB (a streaming SQL engine) for continuous evaluation and aggregation, and Spark Structured Streaming for large-scale analytics. Every downstream system your agentic stack needs to feed — Pinecone for vector search, Snowflake for analytics, Splunk for security monitoring, S3 for compliance archival — already has a battle-tested, production-grade Kafka connector. You configure these integrations. You don't build them. That is a meaningful difference in the pace at which you can stand up a complete agentic memory architecture.

## Replay at scale is a solved problem in Kafka environments

Replay deserves its own emphasis because it's the capability that makes everything else worthwhile — and it's where alternatives to Kafka often fall short.

Replaying a high-throughput Kafka topic from an earlier offset is a routine operational procedure. Teams do it constantly — to recover from bugs, to backfill new consumers, to regenerate derived state after a schema change. The tooling, the operational practices, the runbooks — all of it exists. It's not an exotic operation. It doesn't require heroic engineering. For an agentic system, this means the most powerful capability in the memory architecture (running yesterday's production interactions against a new model or a modified prompt) is available from Day 1, backed by infrastructure that's been running reliably in production environments for a decade.

The alternative — reconstructing replay after the fact from snapshot state, application logs, and distributed traces — is both brittle and expensive. It rarely gives you full fidelity. It doesn't scale to the volume of interactions a production agentic system generates. And it requires custom engineering for every new replay scenario. Kafka's offset model gives you replay as a first-class, infrastructure-supported operation, not a custom engineering project you undertake every time someone wants to test a change.

## Where governance enters

A Kafka cluster that anyone can write to and read from is not an enterprise memory layer. It's a liability. The governance question — who can publish to which topic, what schemas are enforced, what PII gets masked before it hits the broker, what retention policies apply, how audit lineage is maintained — is not a Kafka configuration question. It's a control plane question that needs to be answered at the gateway layer.

This is where Kong Event Gateway plays its role. It sits in front of the Kafka cluster and enforces topic access control, schema validation against the AsyncAPI registry, PII redaction at the broker side, retention and compaction policy, dead-letter routing, and audit lineage — consistently, across every agent and every framework writing to the log. Kong AI Gateway handles the same governance concerns on the synchronous traffic side: every LLM call, every MCP tool invocation, every agent-to-agent communication. The combination gives you a single governed data path that spans both the synchronous and asynchronous layers of your agentic stack.

The agents don't change. The frameworks don't change. The governance is in the infrastructure, and it applies uniformly to everything that flows through it.

## The practical implication

Choosing Kafka as your agentic memory backbone is a technology selection decision with long-term consequences. The log will accumulate events for as long as your agents run. The consumers you build against it will evolve as your stack evolves. The compliance evidence stored in it will need to be available years from now.

The right infrastructure for that commitment is one built around the log as a first-class abstraction, with the retention model, the ordering guarantees, the schema governance, and the downstream ecosystem to support everything the log has to do. Kafka has been doing exactly that work in production environments for a decade. The agentic AI use case isn't pushing the boundaries of what Kafka can do. It's squarely within them.

That's not Kafka maximalism. It's matching the protocol to the workload.

*Want to go deeper on the implementation? Our *[*technical blueprint*](https://konghq.com/blog/engineering/agentic-commit-log-kafka-kong)*technical blueprint** covers the full topic catalog, schema design, and how Kong AI Gateway and Kong Event Gateway connect as the governance layer over this stack. Or explore *[_*Kong AI Gateway*_](https://konghq.com/products/kong-ai-gateway)_*Kong AI Gateway*_* and *[_*Kong Event Gateway*_](https://konghq.com/products/kong-event-gateway)_*Kong Event Gateway*_* directly.*

**Topics**

- [AI Gateway](/blog/tag/ai-gateway)AI Gateway- [Event Gateway](/blog/tag/event-gateway)Event Gateway- [Kafka](/blog/tag/kafka)Kafka- [Agentic AI](/blog/tag/agentic-ai)Agentic AI

Hugo Guerrero

Principal Tech PMM, Kong

# Building the Agentic Commit Log: A Technical Blueprint with Apache Kafka and Kong

[Engineering](/blog/tag)EngineeringJune 19, 2026

The architecture is built around two data planes, both managed by Kong. The first is the sync data plane — Kong AI Gateway — which handles all synchronous traffic between your agents and the outside world. Every inbound client request, every outbo

Hugo Guerrero

# Your AI Agent Knows What. It Doesn't Know Why.

[Enterprise](/blog/tag)EnterpriseMay 19, 2026

When teams build agentic systems — AI that can take autonomous actions, call tools, make decisions, and chain reasoning steps across a session — the conversation focuses on models, frameworks, protocols like MCP (Model Context Protocol) and A2A (

Hugo Guerrero

# Kafka in a DMZ: Protecting AWS MSK with Kong Event Gateway

[Engineering](/blog/tag)EngineeringJuly 14, 2026

The MSK exposure problem Amazon MSK brokers live in private subnets by default. That's the right default. Kafka's protocol wasn't designed for untrusted networks — it has no concept of rate limiting, no built-in field-level encryption, and its ACL

Hugo Guerrero

# A Unified Gateway for APIs + Agentic Applications on VMware VKS with Kong Konnect

[Engineering](/blog/tag)EngineeringMay 20, 2026

Built on top of Kong API Gateway, the Kong AI Gateway is designed to address key challenges in enterprise AI adoption. Modern AI applications rarely rely on a single model; instead, they orchestrate multiple GenAI providers, agent frameworks, Age

Anika Suri

# Dynamic Kafka ACLs: Implementing Identity-Aware Policies with Kong Event Gateway

[Engineering](/blog/tag)EngineeringApril 27, 2026

The Problem with Traditional Kafka ACLs Kafka ACLs are powerful, but they come with significant tradeoffs: Static Definition: They are defined at the broker level and lack context awareness (e.g., who the caller is, their role, or current environm

Hugo Guerrero

# AI Agent Platforms Are Getting Hacked. Here's What's Missing.

[Enterprise](/blog/tag)EnterpriseJuly 2, 2026

The Langflow CVEs and Dify Vulnerabilities: What Actually Happened Langflow's security problems arrived in waves. CVE-2025-3248 introduced a code injection vulnerability allowing remote code execution through unsanitized user input \10\]. Months la

Kong

# Governing Claude Code: How To Secure Agent Harness Rollouts with Kong AI Gateway

[Engineering](/blog/tag)EngineeringMarch 7, 2026

Claude Code is Anthropic's agentic coding and agent harness tool. Unlike traditional code-completion assistants that suggest the next line in an editor, Claude Code operates as an autonomous agent that reads entire codebases, edits files across mult

Alex Drag

# Kafka Was Built for This: The Case for Kafka as the Agent's Memory Layer

## The properties that matter

## Why Kafka, specifically

## Replay at scale is a solved problem in Kafka environments

## Where governance enters

## The practical implication

Recommended posts

# Building the Agentic Commit Log: A Technical Blueprint with Apache Kafka and Kong

# Your AI Agent Knows What. It Doesn't Know Why.

# Kafka in a DMZ: Protecting AWS MSK with Kong Event Gateway

# A Unified Gateway for APIs + Agentic Applications on VMware VKS with Kong Konnect

# Dynamic Kafka ACLs: Implementing Identity-Aware Policies with Kong Event Gateway

# AI Agent Platforms Are Getting Hacked. Here's What's Missing.

# Governing Claude Code: How To Secure Agent Harness Rollouts with Kong AI Gateway

# Building the Agentic Commit Log: A Technical Blueprint with Apache Kafka and Kong

# Your AI Agent Knows What. It Doesn't Know Why.

# Kafka in a DMZ: Protecting AWS MSK with Kong Event Gateway

# A Unified Gateway for APIs + Agentic Applications on VMware VKS with Kong Konnect

# Dynamic Kafka ACLs: Implementing Identity-Aware Policies with Kong Event Gateway

# AI Agent Platforms Are Getting Hacked. Here's What's Missing.

# Governing Claude Code: How To Secure Agent Harness Rollouts with Kong AI Gateway

# Building the Agentic Commit Log: A Technical Blueprint with Apache Kafka and Kong

# Your AI Agent Knows What. It Doesn't Know Why.

# Kafka in a DMZ: Protecting AWS MSK with Kong Event Gateway

# A Unified Gateway for APIs + Agentic Applications on VMware VKS with Kong Konnect

# Dynamic Kafka ACLs: Implementing Identity-Aware Policies with Kong Event Gateway

# AI Agent Platforms Are Getting Hacked. Here's What's Missing.

# Governing Claude Code: How To Secure Agent Harness Rollouts with Kong AI Gateway

## Get started with the API & AI platform

## step-0