Kong AI Gateway 3.11: Reduce Token Spend, Unlock Multimodal Innovation
New Multimodal Capabilities, New AI Prompt Compression, Integration with AWS Bedrock Guardrails, and More
Today, I'm excited to announce one of our largest Kong AI Gateway releases (3.11), which ships with several new features critical in building modern and reliable AI agents in production. We strongly recommend updating to this version to get access to the latest and greatest that AI infrastructure has to offer.
The full change log can be found here.
Introducing 10+ GenAI capabilities, including multimodal endpoints
This new release of Kong AI Gateway is quite significant in the vastness of new GenAI capabilities that we're supporting out of the box.

Batch, Assistants, and Files:
- Batch enables efficient parallel execution of multiple LLM calls, reducing latency and cost at scale.
- Assistants simplify orchestration of multistep AI workflows, enabling developers to build stateful, tool-augmented agents with memory.
- Files provide persistent storage for documents and context, allowing richer, more informed interactions with LLMs across sessions.
Audio Transcription, Translation, and Speech API:
- Speech-to-text: Transcribe audio input to text for call summarization, voice agents, and meeting analysis.
- Real-time translation: Convert spoken input across languages, enabling multilingual voice interfaces.
- Text-to-speech: Synthesize natural-sounding audio from LLM responses to power voice-based agents.
Image Generation and Edits API:
- Image generation: Generate images from text prompts for creative, marketing, and design applications.
- Image editing: Modify existing images using instructions and masks, useful for dynamic content workflows.
- Multimodal agents: Equip agents with visual input/output capabilities to enhance UX and task range.
Realtime API:
- Streaming completions: Stream token-by-token output for fast, interactive user experiences.
- Low latency: Reduce time-to-first-token and improve perceived responsiveness in chat UIs.
- Analytics: Monitor streaming behavior and performance metrics.
Responses API: Enhanced response introspection
- Response metadata: Access logprobs, function calls, and tool usage for each LLM output.
- Debugging and evaluation: Enable advanced observability and response-level quality checks.
- Control and tuning: Use metadata to build reranking, retries, or hybrid generation strategies.
Rerank API:
- Contextual reranking: Improve relevance of retrieved documents and results in RAG pipelines.
- Flexible inputs: Send any list of candidates to be re-ordered based on prompt context.
- Improved accuracy: Boost final LLM response quality through better grounding.
AWS Bedrock Agent APIs:
- Converse / ConverseStream: Execute step-by-step agent plans with or without streaming for advanced orchestration.
- RetrieveAndGenerate: Combine retrieval with generation in one API call for simplified RAG.
- RetrieveAndGenerateStream: Stream RAG results for real-time agent experiences.
Generate and Generate_Stream API:
- Generate: Use open-source models for text generation across tasks and industries.
- Generate Stream: Stream text outputs in real-time for chat and live inference use cases.
- Open model ecosystem: Leverage the flexibility of Hugging Face’s vast library of models.
Embeddings API:
- Text-to-embedding conversion: Transform text into vector representations for semantic search, clustering, recommendations, and RAG.
- Multivendor support: Use OpenAI, Azure, Cohere, Mistral, Gemini, and Bedrock embeddings with a unified interface, including all OpenAI-compatible models.
- Analytics: Track token usage, similarity scoring, and latency metrics for observability.
Introducing a new prompt compression plugin
With generative AI applications becoming more pervasive, the volume of requests to LLMs increases, and costs rise in proportion. As with any cost to our business, we must look for efficiency savings. LLM costs are typically based on token usage — the longer the prompt, the more tokens are consumed per request. Prompts will often contain padding or redundant words or phrases that can be removed or shortened while retaining the semantic intent of the request.

Effectively, we have halved the token count; you can control the level of compression or target token count. Our testing has shown that this approach can achieve up to 5x cost reduction, while keeping 80% of the intended semantic meaning of the original prompt.
Take a look at the docs for more examples.
In real-world usage, prompts are much larger and are made even more so by automatic context injection — whether that be system prompts or injecting Retrieval Augmented Generation (RAG) context. This additional context can also be compressed. In fact, our testing has shown that compressing the context while retaining the original prompt fidelity can provide an optimal balance between cost reduction and intent retention.
This complements other cost-saving measures already available in Kong, such as Semantic Caching, which avoids hitting the LLM service when a similar request has already been answered, and AI Rate Limiting, which can set time-based token or cost limits per application, team, or user.
Introducing AWS Bedrock Guardrails support
It is well understood that generative AI applications can sometimes produce unpredictable outputs – confidence in applications can quickly be eroded by a few missteps. You need to be able to keep your AI-driven applications “on topic”, block profanity or other undesirable language, redact personally identifiable information, and reduce hallucinations. You need guardrails.
Today, with Kong AI Gateway, you can already implement policies that can redact PII data with our built-in PII Sanitizer and Semantic Prompt Guard plugins. We also support policies that enable you to use Azure AI Content Safety to reach out to Azure’s managed guardrails service.
Today, we're announcing support for AWS Bedrock Guardrails to help safeguard your AI applications from a wide range of both malicious and unintended consequences. You can find more examples in the docs.
As a product owner with Kong AI Gateway, you can continue to monitor applications and provide incremental improvements in quality, and react immediately by adjusting policies without any changes to your application code. Kong AI Gateway helps you keep risks in check and increase confidence in the rollout of AI-driven applications and innovation.
Visualize your AI traffic with the new AI Manager
We also recently introduced a new AI Manager in Konnect, enabling you to easily expose LLMs for consumption by your AI agents, and additionally govern, secure, and observe LLM traffic using a brand-new user interface straight from your browser.
With AI Manager you can:
- Manage AI policies via Konnect: Govern, secure, accelerate, and observe AI traffic in a self-managed — or fully managed — AI infrastructure that's easy to deploy.
- Curate your LLM catalog: See what LLMs are available for consumption by AI agents and applications, with custom tiers of access and governance controls.
- Visualize the agentic map: Observe at any given time what agents are consuming the LLMs you've decided to expose to the organization.
- Observe LLM analytics: Measure token, cost, and request consumption with custom dashboards and insights for fine-grained understanding of your AI traffic.

Read more about the new AI manager here.
Get started with Kong AI Gateway today
Ready to try out the new release of Kong AI Gateway? You can get started for FREE with Konnect Plus. If you already have a Konnect account, visit the official product page or dive straight into the demos and tutorials.
Want to learn more about moving past the AI experimentation phase and into production-ready AI systems? Check out this webinar on how to drive real AI value with state-of-the-art AI infrastructure.