• The API Platform for AI.

      Explore More
      Platform Runtimes
      Kong Gateway
      • Kong Cloud Gateways
      • Kong Ingress Controller
      • Kong Operator
      • Kong Gateway Plugins
      Kong AI Gateway
      Kong Mesh
      • Kong Mesh Policies
      Platform Core Services
      • Gateway Manager
      • Mesh Manager
      • Service Catalog
      Platform Applications
      • Developer Portal
      • API and AI Analytics
      • API Products
      Development Tools
      Kong Insomnia
      • API Design
      • API Testing and Debugging
      Self-Hosted API Management
      Kong Gateway Enterprise
      Kong Open Source Projects
      • Kong Gateway OSS
      • Kuma
      • Kong Insomnia OSS
      • Kong Community
      Get Started
      • Sign Up for Kong Konnect
      • Documentation
    • Featured
      Open Banking SolutionsMobile Application API DevelopmentBuild a Developer PlatformAPI SecurityAPI GovernanceKafka Event StreamingAI GovernanceAPI Productization
      Industry
      Financial ServicesHealthcareHigher EducationInsuranceManufacturingRetailSoftware & TechnologyTransportation
      Use Case
      API Gateway for IstioBuild on KubernetesDecentralized Load BalancingMonolith to MicroservicesObservabilityPower OpenAI ApplicationsService Mesh ConnectivityZero Trust SecuritySee all Solutions
      Demo

      Learn how to innovate faster while maintaining the highest security standards and customer trust

      Register Now
  • Customers
    • Documentation
      Kong KonnectKong GatewayKong MeshKong AI GatewayKong InsomniaPlugin Hub
      Explore
      BlogLearning CentereBooksReportsDemosCase StudiesVideos
      Events
      API SummitWebinarsUser CallsWorkshopsMeetupsSee All Events
      For Developers
      Get StartedCommunityCertificationTraining
    • Company
      About UsWhy Kong?CareersPress RoomInvestorsContact Us
      Partner
      Kong Partner Program
      Security
      Trust and Compliance
      Support
      Enterprise Support PortalProfessional ServicesDocumentation
      Press Release

      Kong Advances Konnect Capabilities to Propel Today’s API Infrastructures into the AI Era

      Read More
  • Pricing
  • Login
  • Get a Demo
  • Start for Free
Blog
  • Engineering
  • Enterprise
  • Learning Center
  • Kong News
  • Product Releases
    • API Gateway
    • Service Mesh
    • Insomnia
    • Kubernetes
    • API Security
    • AI Gateway
  • Home
  • Blog
  • Engineering
  • Optimize Your API Gateway with Chaos Engineering
Engineering
August 10, 2022
7 min read

Optimize Your API Gateway with Chaos Engineering

Andrew Kew

As engineers and architects, we automatically build resilience into platforms as far as possible. But what about the unknown failures? What about the unknown behavior of your platform? The philosopher, Socrates, once said "You don’t know what you don’t know". What if I could tell you there is a way to turn these unknowns into knowns – a way to understand how your platform will behave to specific failure events…

Gone are the days of having your manager/CTO walk into the server room and pull out a power plug of one of the servers to simulate "chaos" (true story – I worked with a colleague who did this to his IT team). No support engineer wants to be woken up at 2 am in the morning due to some unknown behavior of the platform they support, and have to determine that it was not designed to withstand these events.

In this two-part blog series, we will first look at Chaos Engineering and then explore how it is useful for testing the resilience of distributed systems. Following this, we will walk through some scenarios that should be tested with Kong Gateway to ensure that your platform is correctly configured to deal with failures. The follow-up blog post will build on this, with a tutorial implementing the above-mentioned scenarios as experiments. We will get hands-on using a Chaos Engineering tool and put your Kong Gateway deployment to the test.

What is Chaos Engineering?

With the move to distributed systems, the confidence around high availability and robustness becomes a complicated beast to tame. Yet another layer of complexity is added with the move from bare metal infrastructure to cloud services. Furthermore, with the added dimension of microservices, you may get a massively complex ecosystem (with exponential possibilities for failure points) depending on the number of services in your system.

Enter Chaos Engineering: the practice of injecting controlled failures into your system to improve its overall response. The way the system behaves in response to these failures provides us with the favorable opportunity of monitoring and improving them.

Benefits of Chaos Engineering

Like most things in life, practicing for events that are out of our control, is a great way to build muscle memory for when it actually happens. A great example of this is a fire drill. If you have ever worked in an office building you will have experienced a fire drill where the entire building vacates onto the street via the emergency exit stairwell. By practicing this time and time again, in the unlikely event of a fire in the building, everyone should know what to do, where to meet and how to ensure everyone gets out of the building safely. Any failures in the process should have been identified in the drills that were performed on a weekly basis.

Similarly, injecting failures into your platform in a controlled manner makes total sense, as it allows for the resolution of any unknown behavior due to failure. It also lets your teams ensure their documentation, system access, playbooks or scripts are correct and builds a reflexive response to an actual critical (Priority 1) event.

According to a State of Chaos Engineering Report, teams who consistently run Chaos Engineering experiments have higher levels of availability than those who have never performed an experiment, or those who only run ad-hoc experiments. [1] Whilst this doesn't necessarily guarantee high levels of availability, it clearly is best practice.

Important key findings expressed in the report are that:

  • "Increased availability and decreased Mean Time To Resolution (MTTR) are the two most common benefits of Chaos Engineering"
  • Network attacks, the highest reported failure, are the most commonly run experiments.

[1]

A brief history of Chaos Engineering

2008

Netflix suffered a massive three-day outage that impacted its DVD shipping business at the time. It took them three days to resolve the problem which, after a lot of forensic analysis, concluded that it was a hardware failure. This pushed them to rethink their architecture and they migrated to a distributed cloud architecture.

2010

When Netflix moved to the cloud it prompted them to design for failure, as hosts can be terminated or fail at any time. With this in mind, Netflix needed a tool to help them test these failures in a controlled way and allow them to identify weaknesses in their architecture. So, they built a tool called Chaos Monkey. This tool was used to pull down instances and services intentionally. It gave Netflix the opportunity to see how their system would respond to this failure and enabled them to design levels of automation that would resolve these failures.

Due to the success of this, Netflix open-sourced Chaos Monkey and created a new job role specifically for Chaos Engineering.

2020

Chaos Engineering went mainstream, with it making headlines on Bloomberg. AWS also released their own Chaos Engineering tool, AWS Fault Injection Simulator [2]

2022

In the last 5 years, Google searches on the topic of Chaos Engineering have exploded, increasing 24 fold. [1]

What are the Principles of Chaos Engineering?

As mentioned before, Chaos Engineering is the practice of injecting failures into a system and observing how the system responds.

The principles of this practice are simple:

  1. Hypothesis: Decide on what kind of issue/error you are going to inject into your platform and what behavior you expect to observe.
  2. Experiment: Design the smallest possible experiment that you can use to test out this hypothesis.
  3. Measure & Improve: Monitor your platform’s response to the experiment by identifying the successes and failures at every step. Then improve the parts which are failing so that they scale or self-heal, which will improve the overall response of the system.

[2]

Chaos Engineering with Kong: Example scenarios

How does this apply to Kong I hear you say? There are a number of scenarios that can be tested with your Kong Gateway deployment. Here are two examples:

Scenario 1 – Kong Hybrid Behavior

Hypothesis

In a hybrid deployment, the control plane and data planes are separated, and the configuration is transferred via a WebSocket. The separation of these components improves the resilience of the platform. This deployment mode is designed to ensure that your data plane is isolated from the control plane and more resilient to failure. If the connection between the control plane and data plane goes down, it can be for a number of reasons:

  1. The control plane has gone down
  2. The database or connection to the database has gone down
  3. The WebSocket between the control plane and data plane has gone down

This experiment will test the failure of the database connected to the control plane.

Experiment

Bring down the control plane's database which will cause the control plane to fail. This will break the connection between the data plane and the control plane.

Measure & Improve

With the database down, the control plane will fail as it needs a database connection to function, but the data plane will be unaffected. To test this the following must be true:

  1. The admin API is not accessible (returns a Failed to connect: Connection refused response)
  2. The data plane (proxy) is accessible (returns 200 HTTP responses)

Scenario 2 – Availability Zone Outage

Hypothesis

When designing your cloud architecture it's always important to ensure that you are designing for failure; not only failure of your application or platform, but also for outages of the data center your platform is running on. Cloud providers give you the ability to handle this by offering different availability zones (AZs) in your regions. Each AZ is a different physical data center. This means you should have the ability to design your system for the unlikely event of an entire AZ going down.

The following experiment will test for exactly that failure, which has happened in the past on several occasions:

  1. 24th August 2019
  2. 25th November 2020
  3. 15th December 2021

Experiment

To test out this hypothesis, a worker node in a random AZ will be pulled down, which could simulate the data center losing power, or even the physical server having some kind of hardware failure.

Cluster autoscaling will be disabled. This will ensure that a new worker is not created in the AZ, thus simulating an AZ outage.

Measure & Improve

This test will ensure the following is correctly configured:

  1. There are at least two data planes and two control planes, each on different worker nodes (i.e. different AZs)
  2. The cluster has more than one worker node and each in different AZs

As long as Kong Gateway and your cluster have been configured correctly, this outage should have no effect on Kong deployment. There will be a data plane and control plane in another AZ which will be unaffected by this failure. To test this the following must be true:

  1. The admin API is accessible (returns 200 HTTP responses)
  2. The data plane (proxy) is accessible (returns 200 HTTP responses)

Summary

Hopefully, this paints a clear picture of the importance of testing your distributed ecosystem in a controlled manner. It gives you the ability to understand how your platform will respond to failures that you are controlling, rather than having to try to dissect an issue after an outage has lost your organization’s revenue.

You saw two test scenarios in a Kong Gateway hybrid deployment. However, this approach can and should be applied to any platform.

In part two of this series, we’ll go one step further to:

  • Introduce ChaosMesh
  • Walk you through a tutorial to set up these experiments
  • Analyze the response of Kong Gateway to these injected failures
  • Recommend improvements for any failures

Planning for unknown failures and how your platform behaves in these situations is the best way to improve the resilience of your API platform. This will reduce the number of critical events and improve confidence in the platform’s usage.

References

[1] https://www.gremlin.com/state-of-chaos-engineering/2021/

[2] https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/

For more content from this author, check out this interview on Kongcast.

Topics:API Gateway
|
API Development
|
API Management
Powering the API world

Increase developer productivity, security, and performance at scale with the unified platform for API management, service mesh, and ingress controller.

Sign up for Kong newsletter

Platform
Kong KonnectKong GatewayKong AI GatewayKong InsomniaDeveloper PortalGateway ManagerCloud GatewayGet a Demo
Explore More
Open Banking API SolutionsAPI Governance SolutionsIstio API Gateway IntegrationKubernetes API ManagementAPI Gateway: Build vs BuyKong vs PostmanKong vs MuleSoftKong vs Apigee
Documentation
Kong Konnect DocsKong Gateway DocsKong Mesh DocsKong Insomnia DocsKong Plugin Hub
Open Source
Kong GatewayKumaInsomniaKong Community
Company
About KongCustomersCareersPressEventsContactPricing
  • Terms•
  • Privacy•
  • Trust and Compliance
  • © Kong Inc. 2025