Resources
  • eBooks
  • Reports
  • Demos
  • Videos
|
  • Value Calculator
  1. Home
  2. Customer Stories
  3. Maersk Deploys AI First Responder to Boost Global API Reliability With Kong

Maersk Deploys AI First Responder to Boost Global API Reliability With Kong


A global logistics leader operating across 130+ countries built Stargate, an AI-driven incident triage engine that cuts time-to-detection and first-response from 20 minutes to seconds

3,500+

developers supported

1,500+

APIs managed

4,000+

downstream consumers reliant on 24X7 uptime

Youtube thumbnail

Managing incidents in seconds, not minutes

A.P. Moller - Maersk is one of the world’s largest shipping and logistics companies, serving more than 100,000 customers and digitally connecting global trade across 130+ countries.

Company

A global logistics leader operating across 130+ countries built Stargate, an AI-driven incident triage engine that cuts time-to-detection and first-response from 20 minutes to seconds

www.maersk.com/
Industry
  • Logistics
Region
  • Europe
Use Case
  • Performance
  • Security
  • Development Productivity
Customer Since2019
Background

Maersk runs one of the largest API ecosystems in the world

When most people imagine Maersk, they see ships, containers, and ports. But behind the physical movement of goods sits a deeply complex digital logistics fabric. Every booking, customs clearance, yard movement, inventory update, and customer status check happens through APIs.

“We are a technology-driven logistics company. Global trade today is connected through data platforms, telemetry, and APIs and those APIs need to be fast, available, and extremely reliable.”

Manish S V Kumar
Senior Engineering Manager, A.P. Moller - Maersk

To power this digital backbone, Maersk has built a distributed infrastructure that handles traffic from thousands of customers across the world. Each incoming request travels a sophisticated multi-layer path:

  • Edge DNS (Akamai) for bot protection, WAF, and routing
  • Kong data planes deployed across regions
  • QMA service mesh sidecars inside gateways for service discovery
  • Backend applications that process business logic — from booking to customs to container tracking

This mesh-enabled, multi-region setup gives Maersk the scale and redundancy required for global operations. But it also magnifies the operational challenge: dozens of systems must work in perfect synchronization for a single API call to succeed.

To manage this complexity, Maersk built a developer platform that abstracts away operational burden. Developers can onboard APIs, configure Akamai, provision gateway workspaces, register services into the mesh, and attach monitoring — all automatically.

The platform accelerated software delivery dramatically. But running a system at Maersk’s scale introduced new operational risks. And those risks started showing up during incidents.

Challenge

When APIs run global trade, every minute of detection delay hurts

Maersk’s developer platform supports:

  • 3,500 developers
  • 1,500+ APIs
  • 4,000+ downstream consumers running critical supply chain flows 24X7

At this scale, failures aren’t hypothetical; they’re inevitable.

“The question is not if something will fail. What happens when something fails? How fast can you detect, triage, and route it?”

Manish S V Kumar
Senior Engineering Manager, A.P. Moller - Maersk

Like many enterprises with large distributed systems, Maersk saw three interconnected operational pain points.

1. Slow MTTD due to noisy, fragmented signals

When an incident occurred — maybe a sudden spike in 503s — the first 15–20 minutes were typically spent just understanding what happened.

  • Command centers pulled in teams from all layers
  • Gateway engineers joined bridges even when the gateway wasn’t at fault
  • Teams scrambled to extract logs
  • No one knew whether failure was in the gateway, service mesh, DNS, or downstream service

Logs contained 80+ attributes, some custom, some standard. They weren’t immediately interpretable by engineers outside the core platform team. Extracting relevant information during an incident was slow manual work.

“Teams often didn’t know whether the problem was in the gateway or somewhere downstream. Just identifying the proxy ID took time,” Kumar said. 

As a result, MTTD ballooned, especially during high-pressure scenarios.

2. High cognitive load on SREs and platform engineers

Senior engineers found themselves pulled into nearly every incident call.

“We get pulled into every bridge no matter where the fault is,” Abhishek Das, Senior Software Engineer, at A.P. Moller - Maersk said. “We realized many incidents had nothing to do with the gateway. But by the time you figure that out, 20 minutes are gone.”

This constant involvement created:

  • Developer fatigue
  • Burnout from late-night on-call
  • Lost engineering productivity
  • A “hero culture” dependence on experts who knew how to interpret the logs

The organization needed a way to automate early triage, reduce noise, and let humans focus on higher-value remediation.

3. Lack of automated blast-radius and ownership analytics

During an outage, business teams care about questions like:

  • How many customers are impacted?
  • What revenue or SLAs are at risk?
  • Who owns the failing service?

But Maersk’s system didn't automatically reveal:

  • Which consumers were affected
  • Which API was behind an endpoint path
  • Who the responsible backend owners were

“We needed to automatically identify the blast radius — which consumers or flows were impacted — because that determines the business response," Das said.

The friction was felt everywhere

This friction wasn’t just technical but cultural:

  • Engineers lacked visibility and felt “blind” during incidents
  • On-call rotations became draining
  • Stakeholders lacked real-time insight into business impact
  • Platform teams were overwhelmed by repetitive triage tasks

The team asked themselves if they could build an agent that becomes a first responder, something that does the first-level triage automatically. That question led to Stargate.

Solution

Stargate, an AI-powered first responder for API incidents

Maersk created Stargate, an operational AI agent that automates the full first-level triage process: detection, RCA, blast-radius analysis, and routing. It is not a chatbot. It's an orchestrated, event-driven, multi-agent system integrated deeply into Maersk’s developer platform and observability stack. And it runs in seconds.

1. Autonomous detection powered by observability

Stargate begins with event-driven detection. Whenever an API is onboarded, Maersk’s platform automatically:

  • Configures gateway routes
  • Registers the service in the mesh
  • Sets up Akamai configs
  • Creates alerting rules in Grafana and Prometheus

So when an API misbehaves, Grafana fires an alert to a Webhook that the agent listens to. This Webhook trigger initiates the entire automated investigation.

2. Orchestration engine that gathers the ground truth

Once triggered, the Stargate orchestrator:

  • Fetches proxy metadata (proxy ID, API name, mapping)
  • Queries service mesh health
  • Retrieves backend application status
  • Looks up ownership and on-call teams
  • Identifies consumers of the failing API

This data comes from Maersk’s internal metadata APIs which themselves talk to:

  • Kong Admin APIs
  • QMA service mesh
  • Maersk’s developer platform
  • Hedwig (on-call management platform)

“An agent is only as good as its integrations,” Das said. This integration layer is what gives the agent enterprise-grade context.

3. Two-agent RCA model: Log Analyzer + RCA Synthesizer

Stargate uses a two-agent architecture.

Agent 1: Log Analyzer

Logs contain more than 80 attributes — too many to send directly to the LLM. So before analysis:

  • Sensitive data is stripped out (keys, IPs, etc.)
  • Only relevant attributes are extracted, such as timestamp, trace IDs, status codes, API path, total request time, total proxy time, total target time

These custom metrics are crucial:

  • total_target_time = 0 means request never reached the backend
  • proxy time vs. target time reveals where latency is introduced
  • error messages indicate DNS or mesh failures

The team explicitly instructs the model on what each attribute means. This dramatically increases accuracy. Providing definitions and examples, a technique called few-shot learning, improved agent accuracy from ~50% to ~80%.

Agent 2: RCA Agent

With the structured log analysis and metadata, the RCA agent performs:

  • Pattern recognition across errors
  • Correlation with mesh health
  • Hypothesis generation (e.g., DNS failure, backend down, policy failure)

Its goal isn’t to provide a final fix, it’s to provide the correct direction within seconds.

4. Automated blast-radius analysis

One of Stargate’s most business-critical capabilities is blast-radius detection. Using metadata APIs, the agent identifies how many consumers use the failing proxy, which teams are affected, and which business flows may be impacted.

The agent automatically detects the kind of insights that used to take multiple people 15–20 minutes to establish, including number of consumers affected, downstream business impact, no authentication/security issues, and if the root cause is in the service mesh, not the gateway

5. Instant notifications with actionable context

Stargate pushess insights to Teams channels (Slack in future). A typical output includes:

  • Summary of what failed
  • Proxy ID
  • Gateway link
  • Access logs
  • Application health
  • Mesh health
  • Blast-radius details
  • Hypothesis of the root cause
  • Links to start debugging immediately

The agent’s summary arrives only a few seconds after the alert fires — before most engineers even join the bridge.

“With this information in hand, I can join the bridge confidently,” Kumar said. “In our demo scenario, gateway wasn't the problem at all, and we knew that instantly.”

6. Built for future autonomy

Stargate was designed as an operational automation framework, not just a tool. It's intentionally modular, and future phases may include:

  • Automatically removing unhealthy data planes from the load balancer
  • Rerouting traffic to healthy regions
  • Triggering owner notifications via Hedwig without human involvement
  • Applying automated mitigations based on policy
Results

Seconds to clarity, not minutes

Maersk’s deployment of Stargate transformed incident management. 

First, it impacted time, helping the team go from 20 minutes of triage to seconds.

Before Kong, incident response relied on lengthy bridge calls involving multiple teams. Engineers spent significant time manually digging through logs, and triage was often based on educated guesses about the failure's origin. With Kong, however, logs are analyzed instantly, and hypotheses are generated automatically. The blast-radius of the incident is calculated in seconds, allowing engineers to join the bridge with immediate, complete situational awareness.


“Previously it took 20 minutes to understand what failed. Now, it takes seconds.”

Manish S V Kumar
Senior Engineering Manager, A.P. Moller - Maersk

Second, Stargate has taken over the draining, repetitive tasks that typically fatigue on-call personnel, including fetching logs and identifying proxy IDs, checking the health of the mesh and backend systems, evaluating DNS and service discovery mechanisms, mapping system owners, and formulating initial hypotheses for an issue. As a result, platform teams can now focus their valuable time on resolving issues rather than merely discovering them.

It's also improved the accuracy and consistency of triage. By structuring and explaining attributes like total_target_time, the agent interprets logs the same way senior engineers do but faster. This means more accurate differentiation between gateway vs. mesh vs. backend failures, less finger-pointing during incidents, and faster routing to correct teams.

Stargate has also resulted in clearer business visibility via blast-radius detection. Business stakeholders gain immediate visibility into the entities impacted, the volume of affected consumers, the potential for SLAs to be compromised, and the potential for revenue jeopardy, which facilitates more prompt customer communication and proactive risk mitigation.

Finally, Stargate has led to a cultural shift toward shared ownership. Embedding incident awareness directly into automation led to several positive shifts, including empowered developers who feel more capable and in control, reduced dependency on specific individual experts, enhanced visibility for teams into ongoing issues, and a more strategic focus for platform teams who can dedicate their efforts to higher-level improvements. Ultimately, Stargate did more than just streamline incident management; it fundamentally transformed how teams work together when facing high-pressure situations.

Maersk’s next decade of operational excellence

Stargate is already expanding into:

  • Automatically modifying load-balancer pools
  • Proactive data plane health management
  • Intelligent alert threshold tuning
  • Automated ownership routing via Hedwig
  • Real-time anomaly detection powered by traffic patterns

Every improvement is evaluated through a consistent lens: does it empower engineers without compromising safety, governance, or operational resilience?

By combining Kong’s gateway, QMA mesh, their internal developer platform, and AI-driven automation, Maersk has built a modern operational fabric that scales not just its technology, but its people.

Table of Contents

  • Maersk runs one of the largest API ecosystems in the world
  • When APIs run global trade, every minute of detection delay hurts
  • Stargate, an AI-powered first responder for API incidents
  • Seconds to clarity, not minutes

Start Your Success Story

Get a Demo

More Customer Stories

Vipps MobilePay Orchestrates Centralized Decentralization with Kong Case Study

Nordic digital wallet unifies platforms, accelerates scaling, and protects 10B+ monthly API calls with a Kubernetes-native gateway model

Maersk Deploys AI First Responder to Boost Global API Reliability With Kong Case Study

A global logistics leader operating across 130+ countries built Stargate, an AI-driven incident triage engine that cuts time-to-detection and first-response from 20 minutes to seconds

Svenska Spel Scores Big with Futuristic API Platform by Kong Case Study

State-owned Swedish gaming company turns custom-built integrations into standardized, secure, cloud-ready API platform

Athenahealth Sets Up a Self-Healing API Gateway with Kong Case Study

A billion daily healthcare transactions, unified through Kong with resilience engineered into every request

Egress AI Gateways: Securely Integrating LLMs in Financial Applications Case Study

Egress AI Gateways: Securely Integrating LLMs in Financial Applications

Connecting Our Country: How APIs Sit at the Heart of Our COVID-19 Response Case Study

Connecting Our Country: How APIs Sit at the Heart of Our COVID-19 Response

How McAfee Leverages Kong Gateway for Data Ingestion at Scale Case Study

How McAfee Leverages Kong Gateway for Data Ingestion at Scale

Get ahead today

While others catch up, you could be leading. Discover how Kong's platform can accelerate your digital transformation and drive innovation at scale.

Get a Demo
Powering the API world

Increase developer productivity, security, and performance at scale with the unified platform for API management, AI gateways, service mesh, and ingress controller.

Sign up for Kong newsletter

    • Platform
    • Kong Konnect
    • Kong Gateway
    • Kong AI Gateway
    • Kong Insomnia
    • Developer Portal
    • Gateway Manager
    • Cloud Gateway
    • Get a Demo
    • Explore More
    • Open Banking API Solutions
    • API Governance Solutions
    • Istio API Gateway Integration
    • Kubernetes API Management
    • API Gateway: Build vs Buy
    • Kong vs Postman
    • Kong vs MuleSoft
    • Kong vs Apigee
    • Documentation
    • Kong Konnect Docs
    • Kong Gateway Docs
    • Kong Mesh Docs
    • Kong AI Gateway
    • Kong Insomnia Docs
    • Kong Plugin Hub
    • Open Source
    • Kong Gateway
    • Kuma
    • Insomnia
    • Kong Community
    • Company
    • About Kong
    • Customers
    • Careers
    • Press
    • Events
    • Contact
    • Pricing
  • Terms
  • Privacy
  • Trust and Compliance
  • © Kong Inc. 2026