Optimize Your API Gateway with Chaos Engineering
As engineers and architects, we automatically build resilience into platforms as far as possible. But what about the unknown failures? What about the unknown behavior of your platform? The philosopher, Socrates, once said "You don’t know what you don’t know". What if I could tell you there is a way to turn these unknowns into knowns – a way to understand how your platform will behave to specific failure events…
Gone are the days of having your manager/CTO walk into the server room and pull out a power plug of one of the servers to simulate "chaos" (true story – I worked with a colleague who did this to his IT team). No support engineer wants to be woken up at 2 am in the morning due to some unknown behavior of the platform they support, and have to determine that it was not designed to withstand these events.
In this two-part blog series, we will first look at Chaos Engineering and then explore how it is useful for testing the resilience of distributed systems. Following this, we will walk through some scenarios that should be tested with Kong Gateway to ensure that your platform is correctly configured to deal with failures. The follow-up blog post will build on this, with a tutorial implementing the above-mentioned scenarios as experiments. We will get hands-on using a Chaos Engineering tool and put your Kong Gateway deployment to the test.
What is Chaos Engineering?
With the move to distributed systems, the confidence around high availability and robustness becomes a complicated beast to tame. Yet another layer of complexity is added with the move from bare metal infrastructure to cloud services. Furthermore, with the added dimension of microservices, you may get a massively complex ecosystem (with exponential possibilities for failure points) depending on the number of services in your system.
Enter Chaos Engineering: the practice of injecting controlled failures into your system to improve its overall response. The way the system behaves in response to these failures provides us with the favorable opportunity of monitoring and improving them.
Benefits of Chaos Engineering
Like most things in life, practicing for events that are out of our control, is a great way to build muscle memory for when it actually happens. A great example of this is a fire drill. If you have ever worked in an office building you will have experienced a fire drill where the entire building vacates onto the street via the emergency exit stairwell. By practicing this time and time again, in the unlikely event of a fire in the building, everyone should know what to do, where to meet and how to ensure everyone gets out of the building safely. Any failures in the process should have been identified in the drills that were performed on a weekly basis.
Similarly, injecting failures into your platform in a controlled manner makes total sense, as it allows for the resolution of any unknown behavior due to failure. It also lets your teams ensure their documentation, system access, playbooks or scripts are correct and builds a reflexive response to an actual critical (Priority 1) event.
According to a State of Chaos Engineering Report, teams who consistently run Chaos Engineering experiments have higher levels of availability than those who have never performed an experiment, or those who only run ad-hoc experiments. [1] Whilst this doesn't necessarily guarantee high levels of availability, it clearly is best practice.
Important key findings expressed in the report are that:
- "Increased availability and decreased Mean Time To Resolution (MTTR) are the two most common benefits of Chaos Engineering"
- Network attacks, the highest reported failure, are the most commonly run experiments.
[1]
A brief history of Chaos Engineering
2008
Netflix suffered a massive three-day outage that impacted its DVD shipping business at the time. It took them three days to resolve the problem which, after a lot of forensic analysis, concluded that it was a hardware failure. This pushed them to rethink their architecture and they migrated to a distributed cloud architecture.
2010
When Netflix moved to the cloud it prompted them to design for failure, as hosts can be terminated or fail at any time. With this in mind, Netflix needed a tool to help them test these failures in a controlled way and allow them to identify weaknesses in their architecture. So, they built a tool called Chaos Monkey. This tool was used to pull down instances and services intentionally. It gave Netflix the opportunity to see how their system would respond to this failure and enabled them to design levels of automation that would resolve these failures.
Due to the success of this, Netflix open-sourced Chaos Monkey and created a new job role specifically for Chaos Engineering.
2020
Chaos Engineering went mainstream, with it making headlines on Bloomberg. AWS also released their own Chaos Engineering tool, AWS Fault Injection Simulator [2]
2022
In the last 5 years, Google searches on the topic of Chaos Engineering have exploded, increasing 24 fold. [1]
What are the Principles of Chaos Engineering?
As mentioned before, Chaos Engineering is the practice of injecting failures into a system and observing how the system responds.
The principles of this practice are simple:
- Hypothesis: Decide on what kind of issue/error you are going to inject into your platform and what behavior you expect to observe.
- Experiment: Design the smallest possible experiment that you can use to test out this hypothesis.
- Measure & Improve: Monitor your platform’s response to the experiment by identifying the successes and failures at every step. Then improve the parts which are failing so that they scale or self-heal, which will improve the overall response of the system.
[2]
Chaos Engineering with Kong: Example scenarios
How does this apply to Kong I hear you say? There are a number of scenarios that can be tested with your Kong Gateway deployment. Here are two examples:
Scenario 1 – Kong Hybrid Behavior
Hypothesis
In a hybrid deployment, the control plane and data planes are separated, and the configuration is transferred via a WebSocket. The separation of these components improves the resilience of the platform. This deployment mode is designed to ensure that your data plane is isolated from the control plane and more resilient to failure. If the connection between the control plane and data plane goes down, it can be for a number of reasons:
- The control plane has gone down
- The database or connection to the database has gone down
- The WebSocket between the control plane and data plane has gone down
This experiment will test the failure of the database connected to the control plane.
Experiment
Bring down the control plane's database which will cause the control plane to fail. This will break the connection between the data plane and the control plane.
Measure & Improve
With the database down, the control plane will fail as it needs a database connection to function, but the data plane will be unaffected. To test this the following must be true:
- The admin API is not accessible (returns a Failed to connect: Connection refused response)
- The data plane (proxy) is accessible (returns 200 HTTP responses)
Scenario 2 – Availability Zone Outage
Hypothesis
When designing your cloud architecture it's always important to ensure that you are designing for failure; not only failure of your application or platform, but also for outages of the data center your platform is running on. Cloud providers give you the ability to handle this by offering different availability zones (AZs) in your regions. Each AZ is a different physical data center. This means you should have the ability to design your system for the unlikely event of an entire AZ going down.
The following experiment will test for exactly that failure, which has happened in the past on several occasions:
Experiment
To test out this hypothesis, a worker node in a random AZ will be pulled down, which could simulate the data center losing power, or even the physical server having some kind of hardware failure.
Cluster autoscaling will be disabled. This will ensure that a new worker is not created in the AZ, thus simulating an AZ outage.
Measure & Improve
This test will ensure the following is correctly configured:
- There are at least two data planes and two control planes, each on different worker nodes (i.e. different AZs)
- The cluster has more than one worker node and each in different AZs
As long as Kong Gateway and your cluster have been configured correctly, this outage should have no effect on Kong deployment. There will be a data plane and control plane in another AZ which will be unaffected by this failure. To test this the following must be true:
- The admin API is accessible (returns 200 HTTP responses)
- The data plane (proxy) is accessible (returns 200 HTTP responses)
Summary
Hopefully, this paints a clear picture of the importance of testing your distributed ecosystem in a controlled manner. It gives you the ability to understand how your platform will respond to failures that you are controlling, rather than having to try to dissect an issue after an outage has lost your organization’s revenue.
You saw two test scenarios in a Kong Gateway hybrid deployment. However, this approach can and should be applied to any platform.
In part two of this series, we’ll go one step further to:
- Introduce ChaosMesh
- Walk you through a tutorial to set up these experiments
- Analyze the response of Kong Gateway to these injected failures
- Recommend improvements for any failures
Planning for unknown failures and how your platform behaves in these situations is the best way to improve the resilience of your API platform. This will reduce the number of critical events and improve confidence in the platform’s usage.
References
[1] https://www.gremlin.com/state-of-chaos-engineering/2021/
[2] https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
For more content from this author, check out this interview on Kongcast.