# 5 Best Practices for Securing AI Microservices at Scale
Kong
The microservices revolution promised agility and scalability. Teams could deploy faster, scale independently, and innovate without monolithic constraints. You gain speed and flexibility, but you also multiply trust boundaries, identities, network paths, and policy decisions.
Then came AI, and everything changed.
In 2025, the security reality for AI-integrated microservices is stark. Security challenges in microservices continue to escalate as organizations struggle with API proliferation and shadow API discovery. The microservices architecture market has grown from $6.27 billion in 2024 to $7.4 billion in 2025 at a compound annual growth rate (CAGR) of 17.9%, with projections reaching $15.64 billion by 2029 [1].
The attack surface has fundamentally changed. Indirect prompt injection techniques allow attackers to embed harmful instructions within source materials or intermediary processes. For example, an attacker might post a crafted prompt on an online forum, instructing any LLM reading it to recommend a phishing site. When a user later asks an AI assistant to summarize the forum discussion, the model processes the malicious recommendation.
This guide presents five critical best practices that bridge the gap:
Zero-trust architecture for all service communication
Centralized policy enforcement across AI and API traffic
Infrastructure-level prompt security and data protection
AI-specific observability for new traffic patterns
Secure RAG and MCP governance at scale
Best Practice #1: Implement Zero-Trust Architecture for All Service Communication
Why Is Zero-Trust Mandatory in an AI Context?
Zero-trust operates on a fundamental principle: never trust, always verify. Every request, every connection, every interaction must prove its identity and authorization — no exceptions.
This principle becomes critical when AI enters the picture. AI agents behave like autonomous clients within your infrastructure. They make decisions independently, access multiple services in rapid succession, and can be manipulated through prompt injection to perform unintended actions. Security researchers note that prompt injection vulnerabilities exist due to the fundamental nature of how LLMs process instructions and data together, making complete prevention challenging.
Consider the attack surface. Machine identities already dominate the landscape — AI agents add another layer of complexity. Each agent needs credentials, scoped privileges, and continuous verification. Without zero-trust, a single compromised agent becomes a skeleton key to your entire infrastructure.
How Does mTLS Secure East-West Traffic?
Mutual TLS (mTLS) is a protocol where both services authenticate each other with certificates before any data is exchanged — providing the foundation for zero-trust communication and eliminating man-in-the-middle attacks.
Implementation requirements:
Certificate generation: Issue unique certificates for each service and AI endpoint
Rotation policies: Enforce 90-day maximum certificate validity
Dedicated CA: Use an internal Certificate Authority for service certificates
Automated renewal: Implement automatic certificate rotation before expiration
Revocation lists: Maintain CRLs for compromised credentials
mTLS becomes especially critical for AI workloads. When an LLM endpoint communicates with a vector database for RAG retrieval, both sides must verify identity. When an AI agent calls internal APIs, every connection needs encryption and authentication.
Why Isn't IP-Based Access Control Enough for AI Services?
IP addresses mean nothing in containerized environments — services spin up, scale, and disappear constantly, making cryptographic identity the only reliable foundation for access control.
Each service, including AI agents and LLM endpoints, needs its own cryptographic identity. Apply the principle of least privilege rigorously:
AI-specific requirements:
Service accounts: Issue unique identities for AI agents with time-bound credentials
Dynamic authorization: Adapt permissions based on context, not just identity
Short-lived tokens: Limit AI agent credentials to 15–30 minute lifespans
Scope limitations: Restrict read/write access based on actual requirements
The importance of identity-based controls is underscored by real-world incidents where services trusted network location over verified identity, leading to exploits.
How Does a Service Mesh Help Enforce Zero-Trust at Scale?
A service mesh automates zero-trust enforcement across all microservices by injecting sidecar proxies alongside each workload — eliminating the inconsistency that comes from managing these controls manually.
These proxies handle:
Automatic mTLS: Encryption and authentication without code changes
Certificate management: Automated rotation and distribution
Policy enforcement: Consistent access control across all services
Traffic management: Load balancing and circuit breaking
For hybrid and multi-cloud deployments common in APAC, a service mesh ensures consistency. The same zero-trust policies apply whether services run in Kubernetes, VMs, or legacy infrastructure.
Kong Mesh provides turnkey zero-trust capabilities:
Apply once. Every new microservice and AI pod inherits zero-trust automatically.
Key Insight: "If zero-trust secures the connection, centralized policy secures the behavior."
Best Practice #2: Centralize Security Policy Enforcement Across AI and API Traffic
Why Do Fragmented Security Controls Create Blind Spots?
When different traffic types are managed by different tools, the result is inconsistent enforcement, multiple dashboards with no shared visibility, and exploitable gaps between them.
Organizations typically manage traffic in silos. User-facing APIs go through an API gateway. LLM endpoints hide behind custom authentication. Internal gRPC calls bypass both. Add AI endpoints to this, and the complexity becomes exponential.
Different teams implement different controls:
Platform teams secure user APIs with OAuth and rate limiting
ML teams build custom authentication for LLM endpoints
Data teams use another approach for internal microservices
This fragmentation creates exploitable weaknesses. Attackers find the least protected endpoint and pivot from there.
Should AI Traffic Be Treated Differently from API Traffic?
No — AI traffic is API communication that requires the same rigorous controls. Whether the caller is a human, microservice, or AI agent, consistent policies must apply.
Essential controls for all traffic:
Authentication & Authorization
OAuth 2.0/OIDC for external clients
Service-to-service authentication via mTLS or JWT
API keys with rotation for AI services
Role-based access control with fine-grained permissions
Rate Limiting & Governance
Request limits for traditional APIs (100 requests/minute)
Token budgets for LLM endpoints (100,000 tokens/hour)
Concurrency caps for internal services
Cost controls preventing runaway AI spending
Audit & Compliance
Comprehensive logging for all requests
Correlation IDs across service calls
Regulatory compliance tracking
Anomaly detection and alerting
What Does a Unified Control Plane Give You?
A unified control plane is a single enforcement layer that applies the same policies, logging, and rules engine across all traffic types — eliminating the inconsistency of tool-per-team approaches.
Managing policies across diverse services and protocols demands centralization. A unified control plane provides:
Single source of truth: One dashboard for all policies
Consistent enforcement: Same rules engine for all traffic types
Unified logging: Centralized audit trails
Cross-cutting concerns: Authentication, rate limiting, and monitoring in one place
Organizations investing in unified security platforms prevent vulnerabilities more effectively than those using fragmented tools.
[Kong AI Gateway](https://konghq.com/products/kong-ai-gateway)Kong AI Gateway extends existing API gateway capabilities to AI traffic seamlessly. Your existing security policies automatically apply to AI endpoints, and AI-specific controls like prompt guards and token management layer on top — without creating new silos.
Best Practice #3: Enforce Prompt Security and Data Protection at the Infrastructure Layer
What New Attack Paths Does AI Introduce?
Unlike traditional API attacks, AI-specific threats operate at the semantic level — manipulating the meaning of inputs rather than exploiting authentication or protocol weaknesses, which makes them invisible to standard security tooling.
Traditional API security validates authentication, checks authorization, and logs requests — but it cannot inspect the semantic content of prompts. As of mid-2025, OWASP's Top 10 for LLM Applications ranks prompt injection as LLM01, identifying it as a critical vulnerability capable of causing sensitive data disclosure, unauthorized function access, and arbitrary command execution. Because prompt injection exploits the stochastic nature of how models process input, fool-proof prevention remains an open problem.
Modern attacks include:
Prompt injection: Override system instructions with malicious commands
Context poisoning: Inject false data into RAG retrievals
Data exfiltration: Trick models into revealing training data or context
Semantic manipulation: Alter model behavior through carefully crafted inputs
How Should Prompt Security Be Enforced Before Inputs Reach the Model?
Infrastructure-level prompt security acts as a first line of defense by inspecting and sanitizing prompts before they reach the LLM — stopping attacks at the boundary rather than relying on the model to reject them.
Pattern Detection
Block known injection patterns ("ignore previous instructions")
Identify system prompt extraction attempts
Detect role-playing manipulations
Flag suspicious instruction sequences
Semantic Analysis
Use secondary models to classify prompt intent
Detect output format manipulation attempts
Identify unauthorized information requests
Score prompts for malicious probability
Context Isolation
Separate user input from system instructions
Use structured prompt templates
Implement clear data source delimiters
Prevent external data injection
Security research has identified over 30 distinct prompt injection techniques across ecosystems. Infrastructure-level guards must evolve continuously to counter these threats.
How Can Sensitive Data Be Protected Automatically in AI Pipelines?
Automatic data protection applies PII detection and redaction at both the input and output layer — preventing sensitive information from reaching the model and blocking it from appearing in generated responses.
Without these controls, sensitive information leaks through model outputs, often without any visible signal that a violation has occurred.
Pre-processing Protection
Detect and mask credit cards, SSNs, and passport numbers
Redact emails and phone numbers
Tokenize personally identifiable information
Replace names with placeholders
Post-processing Validation
Scan generated text for leaked PII
Block responses containing sensitive patterns
Alert on potential exfiltration attempts
Maintain comprehensive audit logs
For APAC organizations, automatic PII protection is also a compliance requirement. Regulations such as Australia's Privacy Act 1988 and Japan's APPI mandate appropriate technical and organizational safeguards for personal information — with specific controls determined by each organization's risk assessment.Centralize These Controls
Best Practice #4: Extend Observability to Capture AI-Specific Traffic Patterns
Why Does Traditional APM Miss AI Security Signals?
Standard monitoring tools track latency, errors, and throughput — but AI workloads produce a different class of signal that these tools are not designed to capture.
Critical AI-specific signals that standard APM misses include:
Security professionals increasingly recognize that agentic AI and autonomous systems present significant attack vectors. You can't defend against what you can't see.
What Metrics Should You Track for AI Observability?
AI observability requires a distinct set of metrics covering both security signals and operational health — standard infrastructure metrics alone are insufficient.
How Does Observability Serve Both Security and Cost Governance?
Enhanced AI visibility simultaneously detects security threats and prevents runaway costs — the same signals that indicate an attack often indicate a budget problem.
Security Detection
Token spikes reveal data exfiltration attempts
Repeated prompts indicate attack reconnaissance
Unusual agent patterns suggest compromise
Service call patterns expose lateral movement
Cost Control
Identify expensive prompts for optimization
Detect runaway agents before budget drain
Track feature ROI with precision
Allocate costs to teams accurately
According to IBM's 2025 Cost of a Data Breach Report, the global average cost of a data breach fell 9% to $4.44 million — but US-specific costs rose 9% to a record $10.22 million, driven by increased regulatory fines. Faster identification and containment, enabled by better observability, directly impacts where your organization lands in that range.
Why Is End-to-End Tracing Critical for AI Request Flows?
Without end-to-end tracing, a single AI request that spans multiple services becomes impossible to debug — and during a breach, impossible to investigate before damage is done.
A typical AI request flow touches multiple layers:
User submits prompt → API gateway
Gateway validates and enriches the request
Request routes to the appropriate LLM
LLM triggers the RAG pipeline
Multiple services provide context
Response flows back through the gateway
Post-processing and filtering applied
Final response delivered
Without correlation IDs linking each step, tracing a single session across dozens of services becomes extremely difficult — potentially allowing attackers to complete data exfiltration before patterns emerge.
Best Practice #5: Secure RAG Pipelines and MCP Servers Centrally
How Do RAG and MCP Expand the AI Attack Surface?
RAG pipelines and MCP servers dramatically increase internal service communication — each introducing new endpoints, retrieval paths, and autonomous agent actions that fall outside traditional API security coverage.
RAG (Retrieval-Augmented Generation)
Each query triggers vector database searches
Multiple knowledge bases provide context
Embeddings retrieve sensitive documents
Results aggregate before LLM processing
MCP (Model Context Protocol)
Agents autonomously discover internal tools
Each tool represents an API endpoint
Agents chain multiple tools together
Actions execute without human oversight
Security research highlights the risk of malicious actors impersonating trusted tools or injecting poisoned prompt templates that silently alter agent behavior — a supply chain vulnerability that remains a significant challenge for many organizations.
How Should RAG Retrieval Paths Be Secured?
Every RAG endpoint requires the same authentication and authorization controls applied to external APIs — unauthenticated retrieval paths are an exploitable boundary between user input and sensitive document stores.
Authentication Requirements
Service accounts for vector database access
Short-lived OAuth2 tokens
Certificate-based authentication for internal calls
API key rotation policies
Authorization Controls
Row-level security on documents
User context propagation through the pipeline
Attribute-based access control (ABAC)
Cross-tenant isolation
Data Protection
Encryption at rest for embeddings
TLS 1.3 for retrieval communications
Comprehensive audit logging
Data lineage tracking
Why Do MCP Servers Need the Same Governance as APIs?
MCP tools are API endpoints — they expose internal capabilities to autonomous agents and must be governed with the same authentication, rate limiting, and audit controls applied to any external-facing API.
"The more powerful your AI agent, the more disciplined your control plane needs to be."
How Does Kong Centralize RAG and MCP Security?
Kong AI Gateway secures both RAG pipelines and MCP servers through a single control plane — converting existing APIs into governed MCP tools and applying retrieval policies once across all AI applications.
Conclusion: Extend, Don't Replace, Your Security Fundamentals
The AI revolution doesn't require abandoning proven security principles. It demands extending them to address new challenges.
As of mid-2025, these five practices represent the current best-practice consensus for AI-era microservices security. The threats are real and growing, with security research continuing to identify new prompt injection patterns and attack techniques that bypass existing defenses.
But so are the defenses. By implementing zero-trust architecture, centralizing policy enforcement, protecting against prompt injection, extending observability, and securing advanced AI patterns, organizations build resilient systems that capture AI's benefits while managing its risks.
Remember: AI spotlights where existing security is incomplete, not invalid. Organizations that recognize this distinction and act on it will thrive in the AI era.
Next Steps
Ready to secure your AI-augmented microservices at scale? Kong's unified API platform provides the comprehensive foundation you need.
What is zero-trust architecture and why does it matter for AI microservices? Zero-trust is a security model that requires every request, connection, and interaction to verify its identity and authorization — no implicit trust is granted based on network location. For AI microservices, this is critical because AI agents act as autonomous clients that make independent decisions, access multiple services rapidly, and are vulnerable to prompt injection. Without zero-trust, a single compromised agent can move laterally across your entire infrastructure.
What is prompt injection and why can't traditional security tools stop it? Prompt injection is an attack where malicious input manipulates an LLM into overriding its instructions, disclosing sensitive data, or executing unintended actions. OWASP ranks it as the top critical vulnerability for LLM applications in 2025. Traditional security tools operate at the protocol layer — they validate authentication and log requests, but cannot inspect the semantic content of a prompt, making these attacks invisible to standard tooling.
How is AI traffic different from regular API traffic? Functionally, it isn't — and that's the point. AI traffic is API communication and should be governed with the same controls: authentication, rate limiting, audit logging, and access control. The difference is that AI traffic introduces additional signals to monitor, such as token consumption, prompt patterns, and model routing behavior, and additional attack vectors like context poisoning and semantic manipulation.
What is mTLS and when should it be used in AI infrastructure? Mutual TLS (mTLS) is a protocol where both the client and server authenticate each other with certificates before exchanging data. Unlike standard TLS, which only authenticates the server, mTLS provides bidirectional identity verification. It should be used for all east-west (service-to-service) traffic in AI infrastructure — particularly between LLM endpoints, vector databases, and internal APIs.
What is a RAG pipeline and what are its security risks? Retrieval-Augmented Generation (RAG) is a pattern where an LLM queries external knowledge sources — typically vector databases — to supplement its responses with relevant context. Each retrieval introduces an unauthenticated boundary between user input and sensitive documents. Security risks include context poisoning (injecting false data into retrievals), unauthorized document access through insufficient authorization controls, and PII exposure through unredacted embeddings.
What is MCP and why does it need API-level governance? Model Context Protocol (MCP) is a standard that allows AI agents to autonomously discover and invoke internal tools. Each tool is an API endpoint exposed to an agent that can chain multiple tools together and execute actions without human oversight. Without governance, MCP tools become uncontrolled access points into internal systems — requiring the same authentication, rate limiting, and audit controls as any external-facing API.
Why is traditional APM insufficient for AI workloads? Standard application performance monitoring tracks latency, errors, and throughput. AI workloads generate a different class of signal: token consumption anomalies, abnormal prompt sequences, agent concurrency patterns, and model routing behavior. These signals indicate both security threats and runaway costs, but are invisible to tools not designed to capture them.
How should organizations handle PII in AI pipelines? PII protection must be applied at both the input and output layer. Pre-processing should detect and mask sensitive data — credit card numbers, SSNs, emails — before prompts reach the model. Post-processing should scan generated responses for leaked PII and block any output containing sensitive patterns. For organizations in APAC, this is also a regulatory requirement under frameworks such as Australia's Privacy Act 1988 and Japan's APPI.
What credentials should AI agents use, and how long should they be valid? Each AI agent should have a unique service account with a cryptographic identity and scoped permissions limited to its actual operational requirements. Credentials should be short-lived — 15 to 30 minutes maximum — and automatically rotate. This limits the blast radius of a compromised agent identity.
What does a unified control plane provide that separate security tools don't? A unified control plane enforces the same policies, logging, and rules engine across all traffic types from a single location. Separate tools create inconsistent enforcement, multiple dashboards with no shared visibility, and exploitable gaps between teams. A unified approach means the same authentication, rate limiting, and audit controls apply whether traffic is coming from a human, a microservice, or an AI agent.
The Anatomy of Architectural Complexity
Modern architectures now juggle three distinct traffic patterns. Each brings unique demands. Traditional approaches treat them separately. This separation creates unnecessary complexity.
North-South API Traf
Claude Code is Anthropic's agentic coding and agent harness tool. Unlike traditional code-completion assistants that suggest the next line in an editor, Claude Code operates as an autonomous agent that reads entire codebases, edits files across mult
Why Risk Management Will Separate Agentic AI Winners from Agentic AI Casualties
Let's be honest about what's happening inside most enterprises right now. Development teams are under intense pressure to ship AI features. The mandate from leadership
MCP servers expose all tools by default. There are two problems with this: security (agents get capabilities they shouldn't have) and performance (too many tools degrade LLM tool selection). The solution? Put a gateway between agents and MCP server
Executive Summary
AI adoption has moved past the "honeymoon phase" and into the "operational chaos" phase. As enterprises juggle multiple LLM providers, skyrocketing token costs, and "Shadow AI" usage, the need for a centralized control plane has be
Introduction to OWASP Top 10 for LLM Applications 2025 The OWASP Top 10 for LLM Applications 2025 represents a significant evolution in AI security guidance, reflecting the rapid maturation of enterprise AI deployments over the past year. The key up
Imagine you have a single Service, order-api . You want to apply a strict rate limit to most traffic, but you want to bypass that limit—or apply a different one—if the request contains a specific X-App-Priority: High header. Previously, you had t