• Explore the unified API Platform
        • BUILD APIs
        • Kong Insomnia
        • API Design
        • API Mocking
        • API Testing & Debugging
        • MCP Client
        • RUN APIs
        • API Gateway
        • Context Mesh
        • AI Gateway
        • Event Gateway
        • Kubernetes Operator
        • Service Mesh
        • Ingress Controller
        • Runtime Management
        • DISCOVER APIs
        • Developer Portal
        • Service Catalog
        • MCP Registry
        • GOVERN APIs
        • Metering & Billing
        • Analytics
        • APIOps & Automation
        • API Observability
        • Why Kong?
      • CLOUD
      • Cloud API Gateways
      • Need a self-hosted or hybrid option?
      • COMPARE
      • Considering AI Gateway alternatives?
      • Kong vs. Postman
      • Kong vs. MuleSoft
      • Kong vs. Apigee
      • Kong vs. IBM
      • GET STARTED
      • Sign Up for Kong Konnect
      • Documentation
  • Agents
      • FOR PLATFORM TEAMS
      • Developer Platform
      • Kubernetes & Microservices
      • Observability
      • Service Mesh Connectivity
      • Kafka Event Streaming
      • FOR EXECUTIVES
      • AI Connectivity
      • Open Banking
      • Legacy Migration
      • Platform Cost Reduction
      • Kafka Cost Optimization
      • API Monetization
      • AI Monetization
      • AI FinOps
      • FOR AI TEAMS
      • AI Cost Control
      • AI Governance
      • AI Integration
      • AI Security
      • Agentic Infrastructure
      • MCP Production
      • MCP Traffic Gateway
      • FOR DEVELOPERS
      • Mobile App API Development
      • GenAI App Development
      • API Gateway for Istio
      • Decentralized Load Balancing
      • BY INDUSTRY
      • Financial Services
      • Healthcare
      • Higher Education
      • Insurance
      • Manufacturing
      • Retail
      • Software & Technology
      • Transportation
      • See all Solutions
      • DOCUMENTATION
      • Kong Konnect
      • Kong Gateway
      • Kong Mesh
      • Kong AI Gateway
      • Kong Insomnia
      • Plugin Hub
      • EXPLORE
      • Blog
      • Learning Center
      • eBooks
      • Reports
      • Demos
      • Customer Stories
      • Videos
      • EVENTS
      • AI + API Summit
      • Webinars
      • User Calls
      • Workshops
      • Meetups
      • See All Events
      • FOR DEVELOPERS
      • Get Started
      • Community
      • Certification
      • Training
      • COMPANY
      • About Us
      • Why Kong?
      • We're Hiring!
      • Press Room
      • Investors
      • Contact Us
      • PARTNER
      • Kong Partner Program
      • SECURITY
      • Trust and Compliance
      • SUPPORT
      • Enterprise Support Portal
      • Professional Services
      • Documentation
      • Press Releases

        Kong Names Bruce Felt as Chief Financial Officer

        Read More
  • Pricing
  • Login
  • Get a Demo
  • Start for Free
Blog
  • AI Gateway
  • AI Security
  • AIOps
  • API Security
  • API Gateway
|
    • API Management
    • API Development
    • API Design
    • Automation
    • Service Mesh
    • Insomnia
    • View All Blogs
  1. Home
  2. Blog
  3. Learning Center
  4. What is AI Observability? Monitoring and Troubleshooting Your LLM Infrastructure
Learning Center
February 27, 2026
23 min read

What is AI Observability? Monitoring and Troubleshooting Your LLM Infrastructure

Kong

Your traditional monitoring dashboards may show perfect infrastructure health, but that doesn't guarantee your large language model (LLM) application is running correctly. AI observability is the crucial fourth pillar that adds behavioral telemetry to detect and troubleshoot unique issues like hallucinations, policy violations, and runaway costs in real time.

Key takeaways

  • AI observability extends traditional monitoring by adding behavioral telemetry for quality, safety, and cost metrics alongside standard logs, metrics, and traces
  • Time-to-First-Token (TTFT) and token usage metrics are critical performance indicators for LLM systems
  • OpenTelemetry GenAI conventions provide standardized telemetry for portable AI observability
  • Retrieval-Augmented Generation (RAG) systems require specialized monitoring for retrieval effectiveness and grounding accuracy
  • Security and compliance frameworks like OWASP Top 10 for LLMs and NIST AI RMF guide responsible AI deployment
what is AI Observability

When "success" doesn't actually guarantee success

Your dashboards scream 99.9% uptime. Latency stays well under 200ms. Error rates dwell at zero. These operational metrics look perfectly all right.

Yet customer complaints continue to hit all-time highs. Your AI assistant prescribes incorrect medical dosages. Your cloud bill tripled overnight. While traditional monitoring looks perfect, reality paints a different picture.

That’s the paradox of LLM operations. Your system runs flawlessly, going by every classic metric. But it hallucinates facts, violates safety policies, and burns through budgets at an alarming rate.

Between traditional monitoring and reality sits AI observability. Traditional monitoring tracks infrastructure health. It can't detect hallucinations, runaway costs, or policy violations. AI observability adds behavioral telemetry on top of standard logs, metrics, and traces.

Though traditional monitoring tells you the patient has a pulse, AI observability tells you if the diagnosis is correct, the treatment is affordable, and the hospital isn't leaking data.

In this post, you'll learn:

  • What AI observability means and how it extends traditional practices
  • Essential metrics like Time-to-First-Token (TTFT) and groundedness scores
  • Practical troubleshooting for latency spikes, cost explosions, and quality issues
  • Implementation strategies using OpenTelemetry GenAI conventions
  • Security and compliance monitoring for AI systems

Defining AI observability: Beyond traditional monitoring

AI observability is the disciplined collection and analysis of telemetry across all AI system components to fully comprehend performance, cost, quality, and safety in real time.

This practice enables you to answer critical questions, such as:

  • Is the response factually accurate?
  • How much did this conversation cost?
  • Does the output comply with safety policies?
  • Why did the model behave this way?

Modern LLM applications face unique challenges. They are non-deterministic, resource-hungry, and challenging to test and control. The standard monitoring approaches fall short in monitoring them.

How different is AI observability from traditional observability?

Traditional observability works well for deterministic systems — environments where the same input reliably produces the same output. AI systems challenge these assumptions in several important ways.

Non-deterministic outputs
With LLMs, the same prompt can generate different responses, especially when temperature settings are above zero. This variability makes simple pass/fail checks ineffective. Instead, teams need continuous quality assessment and probabilistic evaluation frameworks to understand whether outputs meet expectations.

Datadog notes these applications present "challenges due to the complexity of LLM chains, their non-deterministic nature and the security risks they pose".[1] Every response requires individual quality scoring. In practice, this means every response must be evaluated on its own merits rather than assumed to be correct.

Continuous evaluation over binary checks
Traditional systems typically either work or fail. LLMs operate on a spectrum of performance. A response may be technically successful yet factually incorrect, or safe to deliver but unnecessarily expensive. Success is no longer a single metric — it spans quality, safety, cost, and user experience.

Emergent behaviors
LLMs can develop unexpected capabilities and failure modes once deployed. As Neptune AI observes: "When an LLM application returns an unexpected response or fails with an error, we're often left in the dark".[2] Models drift, behaviors evolve, and subtle performance shifts often go undetected by classic dashboards.

Evolution from three to four pillars

Traditional observability rests on three pillars:

  1. Logs: Timestamped event records
  2. Metrics: Quantifiable time-series data
  3. Traces: End-to-end request journeys

AI systems demand a fourth pillar: behavioral signals.

Behavioral signals capture dimensions that infrastructure metrics alone cannot, such as quality, safety, and cost. Datadog’s LLM observability, for example, "provides end-to-end tracing across AI agents with visibility into inputs, outputs, latency, token usage, and errors at each step".

The fourth pillar encompasses the following.

  • Quality metrics, such as accuracy, coherence, and relevance
  • Safety signals, including policy violations, toxicity, and injection attempts
  • Cost intelligence, such as token usage patterns and budget tracking
  • Behavioral patterns, including drift detection and anomaly monitoring

The building blocks of LLM telemetry

While traditional observability signals remain crucial, AI systems generate unique behaviors that require domain-specific measurement. These AI-specific signals move beyond infrastructure health to provide deep visibility into response quality, cost, and user impact, ensuring your LLMs perform reliably and effectively in production.

Traditional signals

Foundational observability signals remain essential for AI systems. While the technology is evolving rapidly, these core signals continue to provide the visibility teams need to operate reliably at scale.

Logs/events
Logs tell the story of how your AI system behaves in production. Capture key details such as model identifiers, timestamps, and request-response pairs to create a clear operational record. Before storing any data, make sure Personally Identifiable Information (PII) is properly redacted to reduce privacy and compliance risk. It’s equally important to retain error messages and system state changes, as these signals are often critical during troubleshooting.

Developers "gain insight by recording prompts and user feedback, tracing user requests through the components, monitoring latency and API usage". Structuring logs for searchability and analysis makes it much easier to investigate incidents, identify patterns, and continuously improve system performance.

Metrics
Metrics provide quantifiable indicators of system health and help teams understand how AI workloads perform under real-world conditions. Track resource utilization across:

  • CPU/GPU utilization
  • Memory footprint
  • Request rates and concurrency
  • Cache hit ratios
  • Queue depths
  • Infrastructure costs

Traces
Tracing maps the complete journey of every request, offering end-to-end visibility into how your AI system operates. Modern LLM applications rely on complex, multi-step workflows in which requests move through embedding generation, vector retrieval, model inference, and post-processing layers. Distributed tracing helps teams pinpoint bottlenecks, diagnose failures faster, and better understand how each component contributes to overall latency and reliability.

AI-specific signals

AI systems generate behaviors that traditional telemetry cannot fully capture. Domain-specific signals provide deeper visibility into how LLMs perform in production, helping teams move beyond infrastructure health to understand response quality, reliability, and user impact.

Performance metrics

Focus on metrics that directly shape the user experience. Monitoring these indicators helps teams detect slowdowns early, maintain responsiveness, and ensure interactions feel smooth and reliable as workloads scale.

  • Time-to-First-Token (TTFT) — This measures initial response latency. Anyscale defines it as "Initial latency before the first token appears". While a chatbot might require TTFT under 500 milliseconds to feel responsive, a code completion tool may need TTFT below 100 milliseconds for a seamless developer experience, according to Key metrics for LLM inference | LLM Inference Handbook.
  • Inter-token latency — Latency defines the flow of streamed tokens. When it rises, responses feel fragmented rather than fluid. Industry guidance typically points to sub-50ms latency for seamless experiences, but the optimal threshold depends on the use case.
  • End-to-end latency — This measures the total time from request to final response. As Anyscale explains, it "reflects the complete waiting time experienced by the user".
  • Queue times — This is the amount of time requests wait before processing begins. Elevated queue times often signal capacity constraints or concurrency limits.

Cost metrics

Token accounting underpins cost management. Most LLM providers charge 3-5X more for output tokens than for input tokens. This ratio reflects the computational difference between processing and generating text.

  • Token usage breakdown — Costs are split between input tokens from prompts and output tokens from completions. For example, Claude Sonnet 3.5 charges $3 per million input tokens and $15 per million output tokens, meaning a 5:1 ratio.
  • Cost attribution — Maintaining control over AI spend requires granular visibility across multiple dimensions, including cost per request, endpoint-level spending, user segment analysis, and feature-level cost attribution. These insights help teams identify inefficiencies, optimize usage, and prevent unexpected budget spikes.
  • Cache effectiveness — Evaluating cache performance is essential for optimizing both cost and system efficiency. Focus on metrics such as hit rates for semantic caching, percentage of computation reused, cost savings attributable to caching. Strong cache performance reduces unnecessary compute, accelerates responses, and improves overall operational economics.

Quality metrics

Evaluate how accurate and relevant your model’s responses are.

  • Groundedness and factual accuracy — This metric checks whether reliable source data support responses. It’s especially important for retrieval-augmented generation (RAG) systems, where answers must stay anchored to trusted information sources.
  • Coherence scores — This measures how logical and consistent the outputs are. Many teams rely on LLM-as-judge approaches or statistical methods to assess coherence.
  • Task completion — This metric indicates whether the AI successfully solved the user’s problem. Tracking success rates by use case helps teams understand where the system performs well and where improvements are needed.
  • Citation accuracy — For RAG systems, this verifies that references are both correct and relevant, helping reduce the risk of misleading or unsupported answers.

Safety metrics

Safety metrics help teams monitor policy compliance and protect systems from misuse.

  • Policy violation rates — This metric tracks how often content filters are triggered across different categories. Monitoring violations such as hate speech, violence, and other restricted content helps teams identify emerging risks.
  • Severity distributions — This measures the overall risk profile of incidents by separating high-severity issues from lower-impact ones. Understanding this distribution helps teams prioritize response efforts.
  • False positive analysis — This analysis identifies when filtering is overly aggressive. The goal is to maintain strong safety controls while preserving a positive user experience.
  • Prompt injection detection — This focuses on identifying attempts to manipulate the model. Monitoring suspicious prompt patterns helps teams respond quickly and strengthen system defenses.

OpenTelemetry GenAI conventions

The OpenTelemetry (OTel) community developed standards for AI observability. These conventions "establish a standard schema for tracking prompts, model responses, token usage, tool/agent calls, and provider metadata".[8]

The key attributes include:

  • gen_ai.system: LLM vendor (such as OpenAI, Anthropic)
  • gen_ai.request.model: Specific model requested
  • gen_ai.usage.input_tokens: Input token count
  • gen_ai.usage.output_tokens: Output token count
  • gen_ai.response.finish_reasons: Why the generation stopped

Example: Python implementation

1from opentelemetry import trace
2from opentelemetry.semconv.gen_ai import GenAIAttributes
3
4tracer = trace.get_tracer(__name__)
5
6with tracer.start_as_current_span("chat.completion") as span:
7    # Set GenAI attributes
8    span.set_attribute(GenAIAttributes.GEN_AI_SYSTEM, "openai")
9    span.set_attribute(GenAIAttributes.GEN_AI_REQUEST_MODEL, "gpt-4")
10    span.set_attribute(GenAIAttributes.GEN_AI_REQUEST_TEMPERATURE, 0.7)
11    
12    # Execute LLM call
13    response = llm.complete(prompt)
14    
15    # Record usage metrics
16    span.set_attribute(GenAIAttributes.GEN_AI_USAGE_INPUT_TOKENS, 
17                      response.usage.prompt_tokens)
18    span.set_attribute(GenAIAttributes.GEN_AI_USAGE_OUTPUT_TOKENS, 
19                      response.usage.completion_tokens)

This standardized approach ensures portability across observability platforms.

Essential metrics & SLIs for LLM systems

Maintaining financial sustainability requires granular cost-visibility and defining Service Level Objectives (SLOs) to keep spending predictable and aligned with business goals.

Performance metrics

Performance metrics help define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for your system. These targets should be adjusted to reflect the needs and expectations of your specific use case.

Example TTFT Targets

  • Interactive chat: <800ms P95 (illustrative target)
  • Voice assistants: <300ms P95 (illustrative target)
  • Code completion: <200ms P95 (illustrative target)

Research from Glean shows "for every additional input token, the P95 TTFT increases by ~0.24ms" 

Inter-token latency
Industry benchmarks provide useful guidance, but your targets should ultimately align with your application’s needs. Consider the following illustrative thresholds:

  • Streaming applications: <50ms average
  • Real-time translation: <40ms average

These targets directly influence perceived fluency, as lower latency makes interactions feel smoother and more responsive.

Throughput metrics
Throughput metrics help you understand how much workload your system can handle and how efficiently it processes requests. The key indicators include:

  • Tokens per second: Measures overall system capacity for generating and processing tokens.
  • Requests per second: Shows how well the system handles concurrent traffic.
  • Queue depth trends: Reveal whether requests are backing up, which may signal capacity constraints or traffic spikes.

Cost metrics

Maintaining financial sustainability requires granular cost-visibility. Organizations often define Service Level Objectives (SLOs) to keep spending predictable and aligned with business goals.

Example token accounting SLOs
The following examples illustrate how teams may set cost targets:

  • Average cost per request: <$0.05 (illustrative)
  • Daily spend: Within 10% of the allocated budget
  • Monthly infrastructure cost: <$10,000 (varies by scale)

Budget controls
Establishing guardrails helps prevent unexpected cost spikes. The best practices include:

  • Alerting at 50%, 75%, and 90% of budget thresholds
  • Implementing circuit breakers to limit runaway usage
  • Monitoring cost velocity to detect rapid spending increases

Optimization metrics
Tracking optimization metrics helps teams improve efficiency without sacrificing performance. Typical targets include:

  • Cache hit rate: >30% (varies by use case)
  • Prompt compression effectiveness: Measures how well prompts reduce token usage without degrading quality
  • Model routing efficiency: Evaluates whether requests are directed to the most cost-effective models

Quality metrics

Assessing quality remains essential and often challenging for AI systems. Clear Service Level Objectives (SLOs) help teams define acceptable performance and maintain user trust.

Example factual accuracy SLOs
These aspirational targets can vary widely depending on the model and use case:

  • Groundedness score: >0.9 for RAG responses
  • Hallucination rate: <5% for knowledge-based tasks
  • Citation accuracy: >95%

Recent benchmarks show meaningful variation in hallucination rates across leading models. As of December 2024, many state-of-the-art systems report rates below 5%. GPT-4o, for example, demonstrates a hallucination rate of approximately 1.5%, while Llama-3.1-405B-Instruct and Anthropic Claude 3.5 Sonnet report rates of 3.9% and 4.6%, respectively. However, even advanced models can exceed 15% hallucination rates when analyzing provided statements, underscoring the importance of evaluating performance within the context of your specific application, according to AI Hallucination: Compare top LLMs like GPT-5.2 in 2026.

User Experience Metrics
User experience metrics help teams understand how effectively AI meets user needs. While targets should reflect your specific application, the following examples provide useful reference points:

  • Relevance ratings: >4.0/5.0
  • Task success rate: >85%
  • Conversation completion: >70%

These indicators offer insight into overall satisfaction, usability, and the system’s ability to guide users toward successful outcomes.

Safety metrics

Safety metrics are essential for protecting users, maintaining compliance, and reducing organizational risk. Elastic highlights the importance of monitoring policy-based interventions and contextual grounding, with built-in support for tools such as Amazon Bedrock Guardrails, Azure AI Foundry, and Azure OpenAI content filters. Read more here.

Example policy compliance SLOs
The following targets illustrate how organizations often define safety expectations:

  • Zero high-severity violations in production
  • False positive rate: <2%
  • Prompt injection block rate: 100%

Security monitoring
In addition to policy compliance, teams should continuously watch for emerging threats, including:

  • Detection of unusual usage patterns
  • Data exfiltration attempts
  • Attack frequency trends

Proactive monitoring helps teams respond quickly, strengthen defenses, and maintain trust in AI systems.

Troubleshooting guide: Common LLM issues and how to diagnose

Effective LLM observability requires a focused strategy to monitor and troubleshoot critical areas like latency, cost, quality, and safety. Here are some common LLM issues and how to diagnose them.

Latency troubleshooting

Symptoms:
Latency problems often become apparent when time-to-first-token (TTFT) jumps from milliseconds to several seconds, making responses feel frozen to users.

Check these metrics:

Start by reviewing the following. 

  • Request queue times
  • GPU utilization patterns
  • Prompt length distributions
  • Concurrency levels

Together, these signals can help pinpoint where delays are taking place. 

Root causes:

  • Long prompts — Overly long prompts are a frequent source of latency. Research confirms "a linear relationship between prompt tokens and TTFT", with each additional token adding roughly 0.20–0.24 milliseconds of latency depending on the model and infrastructure. Solution: Implement prompt compression and set token limits by request type to reduce unnecessary delay. 
  • Traffic bursts — Sudden spikes in traffic can overwhelm available capacity and quickly increase response times. Solution: Using autoscaling with predictive policies helps systems scale ahead of demand, while request batching can improve efficiency during peak load.
  • Misconfigured concurrency — Inefficient GPU memory usage is often the result of poorly tuned concurrency settings. Solution: Adjust batch sizes and monitor memory fragmentation to ensure resources are being used effectively.

Cost management

Symptoms:
Cost issues often become apparent when daily token usage suddenly doubles, and budget alerts trigger earlier than expected. These signs usually indicate that consumption is outpacing projections and deserves immediate investigation.

Check these metrics:

Start by reviewing the following.

  • Input/output token ratios
  • Cache hit rates
  • Conversation lengths
  • Per-user token consumption

These metrics provide a clear picture of where tokens are being used inefficiently and can help pinpoint the source of unexpected spending.

Root causes:

Verbose system prompts — When system prompts are overly detailed, every request carries the full prompt cost, which can quickly inflate overall usage. Solution: Minimizing system prompts, enabling prompt caching, and applying compression techniques can significantly reduce token consumption, although results will vary by implementation.

Infinite loops — Failed requests can sometimes trigger retry storms, causing usage — and costs — to spike without delivering additional value. Solution: Implement exponential backoff, add circuit breakers, and monitor retry patterns to prevent runaway consumption.

Missing output limits — If output limits are not properly configured, models may generate the maximum number of tokens even when shorter responses would suffice. Solution: Set appropriate max_tokens for each use case and monitor actual versus requested usage to ensure tokens are being used efficiently.

Quality monitoring

Symptoms:
Quality issues often arise when groundedness scores start declining and users report receiving nonsensical or unreliable answers. These signals typically suggest that the model is drifting away from trusted data or that supporting systems are not performing as expected.

Check these metrics:

Start by reviewing the following.

  • Factual consistency trends
  • Model version changes
  • Retrieval accuracy (RAG)
  • Context window stats

Examining these indicators together can help teams determine whether the issue arises from the model itself, the retrieval layer, or the quality of the supplied context.

Root causes:

  • Model drift — Provider updates can alter model behavior in subtle or significant ways, sometimes affecting accuracy without immediate visibility. Solution: A/B testing model updates and maintaining regression test suites can help teams catch performance changes early and reduce the risk of unexpected quality degradation.
  • Retrieval breakdown — When a vector database returns irrelevant or low-quality content, the model is more likely to generate incorrect responses. Solution: Monitor retrieval relevance scores and track embedding performance to ensure the system continues to provide the most useful information.
  • Context contamination — Conflicting or low-quality information within the prompt can confuse the model and lead to inconsistent outputs. Solution: Implement context filtering and monitor diversity metrics to maintain cleaner, more reliable inputs.

Safety monitoring

Symptoms:
Safety concerns often become visible when policy violations spike and users begin reporting inappropriate or harmful content. These patterns typically indicate gaps in filtering, shifting user behavior, or emerging threats that require closer attention.

Check these metrics:

Start by reviewing the following.

  • Block reason distributions
  • Severity level trends
  • False positive rates
  • Time-of-day patterns

Root Causes:

  • Overly sensitive filters — Filters that are too aggressive may block legitimate content, creating unnecessary friction for users. Solution: Tune thresholds based on the specific use case and implement domain-specific allowlists to strike the right balance between safety and usability.
  • New attack vectors — As threat tactics evolve, new approaches may bypass existing safeguards. Solution: Regularly update filter rules and monitor threat patterns so that defenses remain effective against emerging risks.

Rate-limit optimization

Symptoms:
Rate-limit issues typically arise when 429 errors occur and response times degrade under heavier load. These signs often indicate that demand is exceeding system capacity or that requests aren’t being managed efficiently.

Check these metrics:

Start by reviewing the following.

  • Request concurrency versus capacity
  • Rate limit header usage
  • Memory and network saturation

These metrics can reveal whether constraints are infrastructure-related or due to traffic patterns that should be better controlled.

Root Causes:

  • Concurrency bottlenecks — When parallel processing capacity is insufficient, requests can queue up quickly and trigger rate limits. Solution: Scaling horizontally and optimizing request routing can distribute traffic more effectively and improve overall throughput.
  • Unoptimized batching — Poor batch formation can reduce processing efficiency and limit the benefits of parallel execution. Solution: Implement dynamic batching and group requests of similar lengths to maximize resource utilization and stabilize performance under load.

Deep diving into RAG observability

Monitoring retrieval effectiveness

Retrieval-augmented generation (RAG) systems require specialized observability to ensure the retrieval pipeline delivers relevant, high-quality context to the model. Without strong visibility into retrieval performance, even advanced models can produce weak or inaccurate responses.

Core metrics:

Several established metrics help teams evaluate how effectively their systems provide relevant information.

  • Recall@k: Percentage of relevant documents in top-k results
  • MRR@k: Mean reciprocal rank of first relevant document
  • nDCG@k: Normalized discounted cumulative gain

Performance targets:
In addition to relevance metrics, monitoring pipeline performance is critical for maintaining responsive user experiences. Teams often track embedding generation time, with targets typically below 50ms, and vector search latency, which commonly aims for under 100ms. It’s also important to watch reranking overhead, as excessive processing at this stage can offset the benefits of fast retrieval.

As EdenAI notes, "Retrieval performance in LLM observability focuses on evaluating the effectiveness of the retrieval component in RAG systems". Together, they help organizations ensure their retrieval layer supports accurate, timely, and contextually grounded AI responses.

Context quality assessment

Maintaining high-quality context is essential for reliable AI outputs, particularly in retrieval-augmented generation (RAG) systems. When the context supplied to a model is incomplete, redundant, or unsupported, response quality can quickly decline. Ongoing assessment helps ensure that retrieved information is relevant, trustworthy, and structured in a way the model can use effectively.

Grounding and citation tracking
Monitor whether retrieved documents genuinely support the claims generated in responses. Tracking citation accuracy and coverage helps confirm that answers are rooted in verifiable sources rather than inferred or fabricated information.

Document redundancy
Duplicate content can crowd the context window and reduce the visibility of unique, high-value information. Identify repeated documents, monitor source diversity, and track information density to ensure the model receives a balanced and comprehensive set of inputs.

Context window utilization
Measure how much of the available context window is actually being used and look for patterns where important information is ignored. Optimizing document ordering based on relevance can improve comprehension and help the model prioritize the most useful material.

Answer grounding and factuality

Hallucination detection
Compare generated statements against source documents and flag unsupported claims in real time. Early detection reduces the risk of misleading responses reaching users.

Citation accuracy
Verify that cited sources truly contain the information being referenced and monitor formatting for consistency. Strong citation practices reinforce trust and make responses easier to validate.

Security and compliance observability

OWASP Top 10 for LLMs

The OWASP Top 10 for LLM Applications highlights some of the most critical security risks organizations face when deploying large language models in production. Understanding these threats — and actively monitoring for them — helps teams strengthen defenses and reduce exposure to emerging attack patterns.

Monitor the following:

  • Prompt injection (LLM01) involves suspicious instruction patterns designed to manipulate model behavior or override safeguards. 
  • Model denial-of-service attacks (LLM04) attempt to exhaust system resources, potentially degrading performance or disrupting availability. 
  • Data leakage (LLM06) centers on the unintended exposure of sensitive information, which can create serious legal, financial, and reputational risks if left unchecked.

Fiddler AI emphasizes "LLMOps tackles the unique risks of deploying large language models in production—such as hallucinations, prompt injections, jailbreaks". Proactive monitoring allows organizations to detect these threats early and maintain safer, more resilient AI systems.

Governance frameworks

Strong governance frameworks help organizations manage risk, maintain compliance, and build trust in AI systems. Adopting established standards provides structure for responsible deployment while ensuring that oversight keeps pace with rapid technological change.

NIST AI risk management framework

The NIST AI Risk Management Framework encourages organizations to track fairness metrics, maintain transparency logs, and document risk assessments as part of a disciplined approach to AI oversight. These practices promote accountability and help teams identify potential issues before they escalate. The NIST AI RMF offers comprehensive guidance for implementing these controls.

ISO/IEC 42001:2023

Recognized as the world's first AI management system standard, ISO/IEC 42001:2023 establishes formal requirements for governing AI responsibly. Organizations are expected to maintain complete audit trails, document risks, validate system performance, and keep detailed stakeholder communication logs. These measures support operational transparency and reinforce regulatory readiness.

Data protection

Protecting sensitive data is a foundational requirement for responsible AI operations. Organizations should take a proactive approach to safeguarding information throughout its lifecycle, ensuring that security practices keep pace with growing data usage.

PII redaction
Automatically detect and mask personally identifiable information (PII) to reduce the risk of exposure. Tokenization can add another layer of protection by replacing sensitive references with secure placeholders while preserving analytical value.

Retention policies
Clear retention policies help prevent unnecessary data storage and reduce compliance risk. Define data lifecycles, implement automatic expiration, and continuously monitor adherence to ensure information is retained only as long as needed.

Prompt logging
Prompt logs support observability and troubleshooting, but they must be managed carefully to protect user privacy. Balance visibility with discretion by using sampling strategies and applying anonymization techniques wherever possible.

Implementation roadmap

Phase 1: Basics and quick wins (weeks 1 - 4)

The first phase focuses on establishing foundational telemetry so teams can quickly gain visibility into system behavior. Prioritizing these early improvements creates a strong operational baseline and makes it easier to scale observability as adoption grows.

Instrument LLM calls
Start by adding OpenTelemetry to all endpoints to ensure consistent tracing across requests. Capture key attributes such as the model being used, prompt length, and response length to build a reliable dataset for performance and cost analysis.

1from opentelemetry.instrumentation.openai import OpenAIInstrumentor
2
3# Auto-instrument OpenAI calls
4OpenAIInstrumentor().instrument()
5
6# Create custom metrics
7meter = metrics.get_meter(__name__)
8token_counter = meter.create_counter(
9    "llm.tokens.total",
10    unit="tokens"
11)

Core metrics setup
Implement essential monitoring capabilities early. Configure time to first token (TTFT) tracking, enable token counting, and create basic dashboards so teams can quickly identify trends and detect anomalies.

Log aggregation
Standardize the logging format to simplify analysis and improve cross-system visibility. Implement PII detection to reduce the risk of sensitive data exposure, and set up centralized log collection so operational insights are accessible from a single location.

Phase 2: Comprehensive coverage (weeks 5 - 12)

In this phase, the focus shifts from foundational visibility to deeper operational insight. Expanding your observability capabilities helps teams better understand system behavior, improve reliability, and manage risk as AI usage scales.

Quality monitoring

Strengthen quality oversight by implementing groundedness scoring to evaluate how well responses align with trusted sources. Add user feedback loops to capture real-world performance signals, and deploy A/B testing to compare model behavior and guide optimization decisions. As Datadog notes, their platform "now natively supports OpenTelemetry GenAI Semantic Conventions, allowing you to instrument your LLM applications once with OTel".

Safety and compliance

Enhance governance by deploying violation tracking to identify policy breaches early. Implement audit logging to support accountability and regulatory readiness, and add prompt injection detection to defend against manipulation attempts.

Dashboards and alerts

Translate insights into action by building SLO-driven dashboards that highlight system health and performance. Implement anomaly detection to catch unusual patterns promptly, and create team-specific dashboards so stakeholders can focus on the metrics most relevant to their roles.

Phase 3: Advanced automation (months 3 - 6)

By this stage, the goal is to focus on observability maturity through automation and continuous optimization. Advanced capabilities help teams respond faster to issues, improve system resilience, and support long-term scalability.

RAG pipeline instrumentation

Deepen visibility into the retrieval pipeline by adding retrieval-specific metrics, tracking how effectively the context window is utilized, and monitoring reranking performance. These insights help ensure the system consistently delivers relevant, high-quality information to the model.

Anomaly detection

Deploy machine learning–based detection to identify unusual patterns before they escalate into larger problems. Pair this with root cause analysis to accelerate troubleshooting, and build self-healing capabilities so certain issues can be resolved automatically without manual intervention.

Continuous evaluation

Establish a culture of ongoing improvement by creating regression tests that catch performance drift early. Deploy canary releases to validate changes in controlled environments, and build feedback loops that continuously inform tuning and optimization efforts.

Vendor-neutral focus

With a vendor-neutral approach, organizations can stay flexible as technologies evolve. By avoiding tight dependencies on a single provider, teams can adapt more easily to changing requirements, control costs, and adopt new capabilities without major rework.

  • Use OpenTelemetry standards: Adopting OpenTelemetry provides a consistent framework for collecting telemetry across tools and platforms. Standardization simplifies integration and makes it easier to shift between vendors when needed.
  • Implement standard exporters: Standard exporters ensure that telemetry data can flow seamlessly into different observability backends. This portability reduces friction during migrations and supports a more adaptable architecture.
  • Design for multi-vendor support: Building systems with multi-vendor compatibility in mind prevents lock-in and expands deployment options. It also helps choose the best tools for specific workloads rather than committing to a single ecosystem.
  • Plan for emerging architectures: AI infrastructure continues to evolve rapidly, so it’s important to design with future architectures in mind. Preparing for new patterns and technologies helps organizations remain resilient and ready to scale.

2026 outlook: Emerging trends in AI observability

As organizations continue to operationalize AI in 2026, observability is evolving from a technical capability into a strategic requirement. Several emerging trends are shaping how teams monitor performance, control risk, and scale AI responsibly.

  • Multi-modal observability: With the growing adoption of vision and audio models, observability must expand beyond text-based workloads. These systems introduce new telemetry requirements, making it important to track image token equivalents, monitor audio processing latency, and understand how different modalities impact overall performance.
  • Automated quality baselines: Teams are increasingly moving toward self-adjusting quality thresholds that adapt based on historical performance. Instead of relying solely on static benchmarks, modern systems can automatically detect deviations, flag degradations, and trigger investigations before users are affected.
  • Cost-aware orchestration: As AI spending rises, organizations are prioritizing smarter workload routing. Real-time orchestration enables dynamic decisions based on cost-performance trade-offs, allowing teams to shift between providers, model sizes, or inference strategies to maintain both efficiency and service quality.
  • Regulatory compliance evolution: New and expanding AI regulations in the United States and the European Union are driving the need for more standardized observability practices. For high-risk applications, detailed audit trails are increasingly becoming mandatory, pushing organizations to strengthen governance and documentation from the outset.

These trends signal a broader shift: AI observability is no longer optional infrastructure; it’s quickly becoming a core pillar of production-ready AI.

Real-world application: Writer’s journey with AI observability

To illustrate the transformative power of AI Observability, consider the experience of Writer, an enterprise-focused generative AI platform. When Writer set out to launch their AI Studio—a suite designed to streamline the creation of generative AI applications—they faced the daunting task of revamping their API infrastructure to accommodate public access.

Their challenge was multifaceted: they needed to implement a public Gate API capable of handling rate limits, security requirements, and seamless integration with existing internal platforms. This was where Kong Konnect came into play, offering a robust solution that not only met these requirements but also enhanced their API management strategy.

Jack Viers, a Backend Developer at Writer, shared insights into how Kong Konnect's fully managed services streamlined their deployment process. The integration enabled Writer to maintain public availability while ensuring security and flexibility through customizable plugins like KeyAuth and ACL, which catered to their fine-grained capability needs. The ability to abstract and simplify their underlying platforms through Kong's intuitive interface proved invaluable, allowing Writer to focus on delivering a top-tier user experience.

The results were significant. Writer achieved a seamless deployment with minimal operational costs and enhanced security, crucial for the success of AI Studio. Viers emphasized the importance of Kong's comprehensive support, which enabled Writer to swiftly address integration challenges, ensuring a smooth rollout of their new product.

This customer story underscores the critical role of AI Observability and specialized tools like AI Gateways in managing complex LLM deployments. It exemplifies how leveraging these technologies can transform potential obstacles into opportunities, driving innovation while maintaining reliability and security. Writer's story is a testament to the strategic advantage of embracing AI Observability, not just to meet current demands, but to future-proof AI initiatives in an ever-evolving technological landscape.

Conclusion: Why AI observability is non-negotiable

Traditional monitoring tells you whether your system is running. AI observability tells you whether it is running correctly. As organizations move from experimentation to production, this distinction becomes critical.

The complexity of large language models (LLMs) demands a new operational mindset. Non-deterministic outputs, fluctuating response quality, and evolving safety risks require specialized monitoring that goes far beyond infrastructure health. Without comprehensive observability, teams lack the visibility needed to make informed decisions and operating AI without that visibility is essentially flying blind.

Business impact

  • Cost control: Strong observability helps teams detect token-burning loops before they drain budgets, monitor spending patterns, and optimize resource usage. With clearer financial insight, organizations can scale AI responsibly without sacrificing predictability.
  • Trust preservation: Unchecked hallucinations and inconsistent outputs can quickly erode user confidence. Observability enables teams to maintain reliable performance, deliver consistent quality, and protect the brand reputation they have worked hard to build.
  • Regulatory readiness: As AI regulations continue to emerge, organizations must be prepared to demonstrate accountability. Observability supports compliance by enabling audit trails, improving transparency, and reinforcing responsible AI practices.

The path forward

Organizations deploying LLMs need observability strategies now, not later. In many cases, the difference between simply having AI capabilities and operating production-ready AI comes down to observability maturity.

Take the following steps now:

  1. Instrument your LLM endpoints with OpenTelemetry.
  2. Set up cost and quality dashboards.
  3. Define initial SLOs for performance and safety (adjusted for your use case)
  4. Begin collecting baseline measurements.

Scale systematically

Follow a phased implementation roadmap, invest in team training, and build alignment across engineering, operations, security, and leadership. Adopting emerging standards will further strengthen your foundation and help future-proof your architecture.

Hidden complexity has a way of becoming tomorrow’s crisis if left unaddressed. Implementing robust AI observability today gives organizations the confidence to innovate at scale while maintaining safety, quality, and cost control.

Ready to transform your LLM observability strategy? Explore how Kong’s AI Gateway delivers rate limiting, analytics, and centralized observability to support AI workloads at scale.

References

  1. Datadog. (2024). "Datadog LLM Observability Is Now Generally Available." Datadog Press Release. https://datadog.gcs-web.com/news-releases/news-release-details/datadog-llm-observability-now-generally-available-help/
  2. Neptune AI. (2024). "LLM Observability: Fundamentals, Practices, and Tools." https://neptune.ai/blog/llm-observability
  3. Datadog. (2024). "LLM Observability Platform." https://www.datadoghq.com/product/llm-observability/
  4. Anyscale. (2024). "Understand LLM latency and throughput metrics." Anyscale Documentation. https://docs.anyscale.com/llm/serving/benchmarking/metrics
  5. BentoML. (2024). "Key metrics for LLM inference." LLM Inference Handbook. https://bentoml.com/llm/inference-optimization/llm-inference-metrics
  6. TGT. (2024). "Observations About LLM Inference Pricing." https://techgov.intelligence.org/blog/observations-about-llm-inference-pricing
  7. MobiSoft Infotech. (2024). "LLM API Pricing Guide: Costs, Token Rates & Models." https://mobisoftinfotech.com/resources/blog/ai-development/llm-api-pricing-guide
  8. Datadog. (2024). "Datadog LLM Observability natively supports OpenTelemetry GenAI Semantic Conventions." Datadog Blog. https://www.datadoghq.com/blog/llm-otel-semantic-convention/
  9. Glean. (2024). "How input token count impacts the latency of AI chat tools." Glean Blog. https://www.glean.com/blog/glean-input-token-llm-latency
  10. HalluLens. (2024). "HalluLens: LLM Hallucination Benchmark." arXiv. https://arxiv.org/html/2504.17550v1
  11. Research AI Multiple. (2024). "AI Hallucination: Compare top LLMs." https://research.aimultiple.com/ai-hallucination/
  12. Elastic. (2024). "LLM Observability - Monitor AI Safety & Performance." https://www.elastic.co/observability/llm-monitoring
  13. EdenAI. (2024). "Top 9 Observability Platforms for LLMs: Unlocking Advanced Monitoring for AI Systems." https://www.edenai.co/post/top-5-paid-observability-platforms-for-llms-unlocking-advanced-monitoring-for-ai-systems
  14. OWASP. (2023). "OWASP Top 10 for Large Language Model Applications v1.1." https://genai.owasp.org/2023/10/18/llm-to-10-v1-1/
  15. Fiddler AI. (2024). "LLM Observability Platform." https://www.fiddler.ai/llmops
  16. NIST. (2023). "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10
  17. ISO. (2023). "ISO/IEC 42001:2023 Artificial intelligence — Management system." https://www.iso.org/standard/42001
  18. OpenTelemetry. (2024). "Semantic Conventions for GenAI operations." https://opentelemetry.io/docs/specs/semconv/gen-ai/
ObservabilityAIAI GatewayAI SecurityAIOps

Table of Contents

  • Key takeaways
  • Defining AI observability: Beyond traditional monitoring
  • The building blocks of LLM telemetry
  • Essential metrics & SLIs for LLM systems
  • Troubleshooting guide: Common LLM issues and how to diagnose
  • Deep diving into RAG observability
  • Security and compliance observability
  • Implementation roadmap
  • 2026 outlook: Emerging trends in AI observability
  • Real-world application: Writer’s journey with AI observability
  • Conclusion: Why AI observability is non-negotiable
  • References

More on this topic

eBooks

The AI Connectivity Playbook: How to Build, Govern & Scale AI

Videos

From APIs to AI Agents: Building Real AI Workflows with Kong

See Kong in action

Accelerate deployments, reduce vulnerabilities, and gain real-time visibility. 

Get a Demo
Topics
ObservabilityAIAI GatewayAI SecurityAIOps
Kong

Recommended posts

API Gateway vs. AI Gateway

Learning CenterNovember 3, 2025

The Gateway Evolution An unoptimized AI inference endpoint can burn through thousands of dollars in minutes. This isn't hyperbole. It's the new reality of artificial intelligence operations. When GPT-4 processes thousands of tokens per request, tradi

Kong

What is AI Governance? 2026 Framework Guide

Learning CenterJanuary 2, 2026

AI governance establishes the principles, roles, processes, and controls for responsible AI deployment. It transforms abstract ethics into concrete practices. Think of ​​AI governance as a rulebook for how to use AI in a secure, ethical, observable,

Kong

AI Voice Agents with Kong AI Gateway and Cerebras

EngineeringNovember 24, 2025

Kong Gateway is an API gateway and a core component of the Kong Konnect platform . Built on a plugin-based extensibility model, it centralizes essential functions such as proxying, routing, load balancing, and health checking, efficiently manag

Claudio Acquaviva

AI Guardrails: Ensure Safe, Responsible, Cost-Effective AI Integration

EngineeringAugust 25, 2025

Why AI guardrails matter It's natural to consider the necessity of guardrails for your sophisticated AI implementations. The truth is, much like any powerful technology, AI requires a set of protective measures to ensure its reliability and integrit

Jason Matis

Securing Enterprise AI: OWASP Top 10 LLM Vulnerabilities Guide

EngineeringJuly 31, 2025

Introduction to OWASP Top 10 for LLM Applications 2025 The OWASP Top 10 for LLM Applications 2025 represents a significant evolution in AI security guidance, reflecting the rapid maturation of enterprise AI deployments over the past year. The key up

Michael Field

How to Master AI/LLM Traffic Management with Intelligent Gateways

EnterpriseMay 26, 2025

As businesses increasingly harness the power of artificial intelligence (AI) and large language models (LLMs), a new challenge emerges: managing the deluge of AI requests flooding systems. This exponential growth in AI traffic creates what could be

Kong

Securing, Observing, and Governing MCP Servers with Kong AI Gateway

Product ReleasesApril 24, 2025

The explosion of AI-native applications is upon us. With each new week, massive innovations are being made in how AI-centric applications are being built. There are a variety of tools developers need to consider, be it supplying live contextual data

Greg Peranich

Ready to see Kong in action?

Get a personalized walkthrough of Kong's platform tailored to your architecture, use cases, and scale requirements.

Get a Demo
Powering the API world

Increase developer productivity, security, and performance at scale with the unified platform for API management, AI gateways, service mesh, and ingress controller.

Sign up for Kong newsletter

    • Platform
    • Kong Konnect
    • Kong Gateway
    • Kong AI Gateway
    • Kong Insomnia
    • Developer Portal
    • Gateway Manager
    • Cloud Gateway
    • Get a Demo
    • Explore More
    • Open Banking API Solutions
    • API Governance Solutions
    • Istio API Gateway Integration
    • Kubernetes API Management
    • API Gateway: Build vs Buy
    • Kong vs Postman
    • Kong vs MuleSoft
    • Kong vs Apigee
    • Documentation
    • Kong Konnect Docs
    • Kong Gateway Docs
    • Kong Mesh Docs
    • Kong AI Gateway
    • Kong Insomnia Docs
    • Kong Plugin Hub
    • Open Source
    • Kong Gateway
    • Kuma
    • Insomnia
    • Kong Community
    • Company
    • About Kong
    • Customers
    • Careers
    • Press
    • Events
    • Contact
    • Pricing
  • Terms
  • Privacy
  • Trust and Compliance
  • © Kong Inc. 2026