Maersk Boosted Global API Reliability with AI and Kong

Maersk Deploys AI First Responder to Boost Global API Reliability With Kong

A global logistics leader operating across 130+ countries built Stargate, an AI-driven incident triage engine that cuts time-to-detection and first-response from 20 minutes to seconds

3,500+

developers supported

1,500+

APIs managed

4,000+

downstream consumers reliant on 24X7 uptime

A.P. Moller - Maersk is one of the world’s largest shipping and logistics companies, serving more than 100,000 customers and digitally connecting global trade across 130+ countries.

Background

When most people imagine Maersk, they see ships, containers, and ports. But behind the physical movement of goods sits a deeply complex digital logistics fabric. Every booking, customs clearance, yard movement, inventory update, and customer status check happens through APIs.

“We are a technology-driven logistics company. Global trade today is connected through data platforms, telemetry, and APIs and those APIs need to be fast, available, and extremely reliable.”

Manish S V Kumar

Senior Engineering Manager, A.P. Moller - Maersk

To power this digital backbone, Maersk has built a distributed infrastructure that handles traffic from thousands of customers across the world. Each incoming request travels a sophisticated multi-layer path:

Edge DNS (Akamai) for bot protection, WAF, and routing
Kong data planes deployed across regions
QMA service mesh sidecars inside gateways for service discovery
Backend applications that process business logic — from booking to customs to container tracking

This mesh-enabled, multi-region setup gives Maersk the scale and redundancy required for global operations. But it also magnifies the operational challenge: dozens of systems must work in perfect synchronization for a single API call to succeed.

To manage this complexity, Maersk built a developer platform that abstracts away operational burden. Developers can onboard APIs, configure Akamai, provision gateway workspaces, register services into the mesh, and attach monitoring — all automatically.

The platform accelerated software delivery dramatically. But running a system at Maersk’s scale introduced new operational risks. And those risks started showing up during incidents.

Challenge

Maersk’s developer platform supports:

3,500 developers
1,500+ APIs
4,000+ downstream consumers running critical supply chain flows 24X7

At this scale, failures aren’t hypothetical; they’re inevitable.

“The question is not if something will fail. What happens when something fails? How fast can you detect, triage, and route it?”

Manish S V Kumar

Senior Engineering Manager, A.P. Moller - Maersk

Like many enterprises with large distributed systems, Maersk saw three interconnected operational pain points.

1. Slow MTTD due to noisy, fragmented signals

When an incident occurred — maybe a sudden spike in 503s — the first 15–20 minutes were typically spent just understanding what happened.

Command centers pulled in teams from all layers
Gateway engineers joined bridges even when the gateway wasn’t at fault
Teams scrambled to extract logs
No one knew whether failure was in the gateway, service mesh, DNS, or downstream service

Logs contained 80+ attributes, some custom, some standard. They weren’t immediately interpretable by engineers outside the core platform team. Extracting relevant information during an incident was slow manual work.

“Teams often didn’t know whether the problem was in the gateway or somewhere downstream. Just identifying the proxy ID took time,” Kumar said.

As a result, MTTD ballooned, especially during high-pressure scenarios.

2. High cognitive load on SREs and platform engineers

Senior engineers found themselves pulled into nearly every incident call.

“We get pulled into every bridge no matter where the fault is,” Abhishek Das, Senior Software Engineer, at A.P. Moller - Maersk said. “We realized many incidents had nothing to do with the gateway. But by the time you figure that out, 20 minutes are gone.”

This constant involvement created:

Developer fatigue
Burnout from late-night on-call
Lost engineering productivity
A “hero culture” dependence on experts who knew how to interpret the logs

The organization needed a way to automate early triage, reduce noise, and let humans focus on higher-value remediation.

3. Lack of automated blast-radius and ownership analytics

During an outage, business teams care about questions like:

How many customers are impacted?
What revenue or SLAs are at risk?
Who owns the failing service?

But Maersk’s system didn't automatically reveal:

Which consumers were affected
Which API was behind an endpoint path
Who the responsible backend owners were

“We needed to automatically identify the blast radius — which consumers or flows were impacted — because that determines the business response," Das said.

The friction was felt everywhere

This friction wasn’t just technical but cultural:

Engineers lacked visibility and felt “blind” during incidents
On-call rotations became draining
Stakeholders lacked real-time insight into business impact
Platform teams were overwhelmed by repetitive triage tasks

The team asked themselves if they could build an agent that becomes a first responder, something that does the first-level triage automatically. That question led to Stargate.

Solution

Maersk created Stargate, an operational AI agent that automates the full first-level triage process: detection, RCA, blast-radius analysis, and routing. It is not a chatbot. It's an orchestrated, event-driven, multi-agent system integrated deeply into Maersk’s developer platform and observability stack. And it runs in seconds.

1. Autonomous detection powered by observability

Stargate begins with event-driven detection. Whenever an API is onboarded, Maersk’s platform automatically:

Configures gateway routes
Registers the service in the mesh
Sets up Akamai configs
Creates alerting rules in Grafana and Prometheus

So when an API misbehaves, Grafana fires an alert to a Webhook that the agent listens to. This Webhook trigger initiates the entire automated investigation.

2. Orchestration engine that gathers the ground truth

Once triggered, the Stargate orchestrator:

Fetches proxy metadata (proxy ID, API name, mapping)
Queries service mesh health
Retrieves backend application status
Looks up ownership and on-call teams
Identifies consumers of the failing API

This data comes from Maersk’s internal metadata APIs which themselves talk to:

Kong Admin APIs
QMA service mesh
Maersk’s developer platform
Hedwig (on-call management platform)

“An agent is only as good as its integrations,” Das said. This integration layer is what gives the agent enterprise-grade context.

3. Two-agent RCA model: Log Analyzer + RCA Synthesizer

Stargate uses a two-agent architecture.

Agent 1: Log Analyzer

Logs contain more than 80 attributes — too many to send directly to the LLM. So before analysis:

Sensitive data is stripped out (keys, IPs, etc.)
Only relevant attributes are extracted, such as timestamp, trace IDs, status codes, API path, total request time, total proxy time, total target time

These custom metrics are crucial:

total_target_time = 0 means request never reached the backend
proxy time vs. target time reveals where latency is introduced
error messages indicate DNS or mesh failures

The team explicitly instructs the model on what each attribute means. This dramatically increases accuracy. Providing definitions and examples, a technique called few-shot learning, improved agent accuracy from ~50% to ~80%.

Agent 2: RCA Agent

With the structured log analysis and metadata, the RCA agent performs:

Pattern recognition across errors
Correlation with mesh health
Hypothesis generation (e.g., DNS failure, backend down, policy failure)

Its goal isn’t to provide a final fix, it’s to provide the correct direction within seconds.

4. Automated blast-radius analysis

One of Stargate’s most business-critical capabilities is blast-radius detection. Using metadata APIs, the agent identifies how many consumers use the failing proxy, which teams are affected, and which business flows may be impacted.

The agent automatically detects the kind of insights that used to take multiple people 15–20 minutes to establish, including number of consumers affected, downstream business impact, no authentication/security issues, and if the root cause is in the service mesh, not the gateway

5. Instant notifications with actionable context

Stargate pushess insights to Teams channels (Slack in future). A typical output includes:

Summary of what failed
Proxy ID
Gateway link
Access logs
Application health
Mesh health
Blast-radius details
Hypothesis of the root cause
Links to start debugging immediately

The agent’s summary arrives only a few seconds after the alert fires — before most engineers even join the bridge.

“With this information in hand, I can join the bridge confidently,” Kumar said. “In our demo scenario, gateway wasn't the problem at all, and we knew that instantly.”

6. Built for future autonomy

Stargate was designed as an operational automation framework, not just a tool. It's intentionally modular, and future phases may include:

Automatically removing unhealthy data planes from the load balancer
Rerouting traffic to healthy regions
Triggering owner notifications via Hedwig without human involvement
Applying automated mitigations based on policy

Results

Maersk’s deployment of Stargate transformed incident management.

First, it impacted time, helping the team go from 20 minutes of triage to seconds.

Before Kong, incident response relied on lengthy bridge calls involving multiple teams. Engineers spent significant time manually digging through logs, and triage was often based on educated guesses about the failure's origin. With Kong, however, logs are analyzed instantly, and hypotheses are generated automatically. The blast-radius of the incident is calculated in seconds, allowing engineers to join the bridge with immediate, complete situational awareness.

“Previously it took 20 minutes to understand what failed. Now, it takes seconds.”

Manish S V Kumar

Senior Engineering Manager, A.P. Moller - Maersk

Second, Stargate has taken over the draining, repetitive tasks that typically fatigue on-call personnel, including fetching logs and identifying proxy IDs, checking the health of the mesh and backend systems, evaluating DNS and service discovery mechanisms, mapping system owners, and formulating initial hypotheses for an issue. As a result, platform teams can now focus their valuable time on resolving issues rather than merely discovering them.

It's also improved the accuracy and consistency of triage. By structuring and explaining attributes like total_target_time, the agent interprets logs the same way senior engineers do but faster. This means more accurate differentiation between gateway vs. mesh vs. backend failures, less finger-pointing during incidents, and faster routing to correct teams.

Stargate has also resulted in clearer business visibility via blast-radius detection. Business stakeholders gain immediate visibility into the entities impacted, the volume of affected consumers, the potential for SLAs to be compromised, and the potential for revenue jeopardy, which facilitates more prompt customer communication and proactive risk mitigation.

Finally, Stargate has led to a cultural shift toward shared ownership. Embedding incident awareness directly into automation led to several positive shifts, including empowered developers who feel more capable and in control, reduced dependency on specific individual experts, enhanced visibility for teams into ongoing issues, and a more strategic focus for platform teams who can dedicate their efforts to higher-level improvements. Ultimately, Stargate did more than just streamline incident management; it fundamentally transformed how teams work together when facing high-pressure situations.

Maersk’s next decade of operational excellence

Stargate is already expanding into:

Automatically modifying load-balancer pools
Proactive data plane health management
Intelligent alert threshold tuning
Automated ownership routing via Hedwig
Real-time anomaly detection powered by traffic patterns

Every improvement is evaluated through a consistent lens: does it empower engineers without compromising safety, governance, or operational resilience?

By combining Kong’s gateway, QMA mesh, their internal developer platform, and AI-driven automation, Maersk has built a modern operational fabric that scales not just its technology, but its people.

Maersk Deploys AI First Responder to Boost Global API Reliability With Kong

Managing incidents in seconds, not minutes

Maersk runs one of the largest API ecosystems in the world

When APIs run global trade, every minute of detection delay hurts

1. Slow MTTD due to noisy, fragmented signals

2. High cognitive load on SREs and platform engineers

3. Lack of automated blast-radius and ownership analytics

The friction was felt everywhere

Stargate, an AI-powered first responder for API incidents

1. Autonomous detection powered by observability

2. Orchestration engine that gathers the ground truth

3. Two-agent RCA model: Log Analyzer + RCA Synthesizer

Agent 1: Log Analyzer

Agent 2: RCA Agent

4. Automated blast-radius analysis

5. Instant notifications with actionable context

6. Built for future autonomy

Seconds to clarity, not minutes

Maersk’s next decade of operational excellence

More Customer Stories

Get ahead today