Video
Context‑Aware LLM Traffic Management with RAG and AI Gateway
Orchestrate RAG on Kubernetes with Kaido and Kong AI Gateway to enable semantic routing, cost‑aware load balancing, observability, and in‑cluster control.
Learn how to route context-aware LLM traffic on Kubernetes using Retrieval Augmented Generation (RAG) with Kaido and Kong AI Gateway. We cover semantic routing, cost/latency-aware load balancing, in-cluster control, and observability for production GenAI.
What you’ll learn:
- Why RAG reduces hallucinations vs. fine-tuning
- Kaido RAG Engine CRDs: indexes, nodes, embeddings, vector DB
- In-cluster model hosting and OpenAI-compatible endpoints
- Kong AI Gateway: rate limiting, weighted/semantic load balancing, fallbacks
- Observability and governance across LLM endpoints