In this episode of Kongcast, I spoke with Jason Yee, director of advocacy at Gremlin, about the concept of chaos engineering, why even the best engineers can't control everything, and tools and tactics to help build app resiliency.
Check out the transcript and video from our conversation below, and be sure to subscribe to get email alerts for the latest new episodes.
Viktor: Jason, can you tell us a little bit about yourself? And how did you end up in this field of chaos engineering?
Jason: I work at Gremlin as the director of advocacy, which means I lead our developer relations team, including events. And as with any small startup, it’s all about wearing multiple hats. So, I do a bit of customer success to help out our customers and manage a few accounts and lend a hand wherever I can.
What's Chaos Engineering?
Chaos engineering is why I joined Gremlin, and it’s something that I’m super excited and passionate about. My experience with chaos engineering started a long time ago. I think most people who have heard of chaos engineering are familiar with things like Netflix’s Chaos Monkey, which is coming up on, I think, over 10 years old now. It’s pretty old.
Viktor: Time flies.
Jason: Yeah! Time flies, and at the same time, it feels like it’s still a new thing.
So we have this weird, maybe temporal vortex within tech where old things seem new despite being old, or maybe it’s just that they get reinvented. And I think that’s true for chaos engineering—it’s been reinvented.
I heard about it back when I worked at O’Reilly Media, and I was helping run the Velocity conference after that and went to Datadog. And at Datadog, we did chaos engineering, so I could get my feet wet or get my hands dirty, as you will, and help with that practice. And then that’s how I met the folks at Gremlin, and so I joined the team there.
As we talk about reinventing, some of those early folks from Netflix and Amazon who pioneered chaos engineering realized that not everybody is Netflix or Amazon. And so, they decided to make a platform that would make it easy for everybody else to do chaos engineering. And so that’s basically what Gremlin is.
Viktor: That’s exciting. And how I came about this topic is by reading the Netflix engineering blog and during the time of the Great Migration to cloud that Netflix is famous for. We learn a lot of cool tools that they do, starting from the tools for their build process that will end up being open source as a kind of like a nebulous set of plugins for Gretel and some of the cloud management tools that end up being the kind of foundation for Spring Cloud World and the Chaos Monkey.
While I was preparing to do this episode, I found a very interesting YouTube channel, and the host has some resemblance with you. It’s called "Reliability Engineering,' and the host talked about the chaos monkey. So what is the chaos monkey all about?
During my time, we didn't need to have a monkey. We might have a night janitor who randomly disconnects the servers because he came into the office and started mopping the floors. That's how we did the chaos engineering back in the day. What changed?
Jason: I completely resonate with that. Pulling the cables was one of my first experiences out of college. Several decades ago, I was testing out whether our UPS system would work. We had a bunch of on-prem servers because the cloud didn’t exist. We have this great battery backup system. We wanted to ensure that our servers would stay up by pulling the cable out of the wall. Except I pulled the wrong cable—I pulled the cable from the server into the UPS. So obviously, that didn’t work.
Viktor: Another example is that you randomly knock off the box of your servers, and then some of your servers die.
Another is to start taking the disks out of the disk crate randomly. After that, you see how your data will be corrupted. But we did this unintentionally. Like sometimes, it was just stupidity, sometimes an inexperience thing. My understanding of this approach is that instead of - and I’m stealing your quote from you from one of your episodes on your YouTube show - we need to embrace this chaos instead of fighting and building a 100% resilient system. We need to embrace these failures and understand that these things will happen. And rather than building 100% reliable systems, we need to build systems that will be resilient to failures. You need to be prepared for how the system will work in the broken state.
Jason: Yeah, absolutely. If we think back to that original chaos monkey, it was how do we embrace failure? How do we just choose that failure is normal? That original Netflix team's idea was, let’s just introduce failure. So it’s always there, and then our teams will work around that. They’ll build systems that will respond gracefully to servers going down, and they’ll just restart.
Cloud native infrastructures are so complex. You have microservices, and you have all these dependencies. And there’s an understanding that when failure happens, it sometimes or many times is not what you would expect.
One of the ways to get better at your systems is not just embracing failure. It's also understanding that you don’t know everything about how your systems will behave.
Invest time in understanding how they work in the real world, rather than how they look on an architecture diagram, by introducing that failure and seeing what happens in the real world.
And what kind of component the system is dependent upon and which components you need to have additional - maybe, you know, the backup or like an additional set of replicas or things like that. And apparently, this topic is still very relevant in the cloud world. Would you agree with this?
Jason: Oh yeah. Absolutely.
Viktor: The vendors provide us some assurance that they will be running your infrastructure. And that they have multiple zones and regions that can provide some reliability. But things happen. Five years ago, a tractor hit the Amazon data center - or not even the data center itself - but an electric station near this data center. This is an example of the unpredictable things and human factors that come into play. You can build data centers that are 100% protected, but sometimes accidents will happen. Be prepared for this type of scenario. This is what I think the essence of chaos engineering is, right?
Jason: Yeah, the essence is that you can’t control everything, even if you’re the world’s best engineer. You can’t account for everything because you're constantly interacting with other services in our distributed cloud world, especially outside your domain. You use services like Kong or Gremlin. And these are outside of your control. You have to have a better understanding of when things fail, whether that’s yours or these other systems, what does that look like? How does that affect you?
We’ve gotten to a world with everything so connected that your customers don’t care. If Amazon goes down, and it’s not Amazon - it happens to be some major internet provider that runs the backbone between, say, the U.S. and Europe - nobody cares. Everybody just thinks, "Hey, Amazon’s down," and you take the blame for that.
And so, with your own systems, it becomes that reliability domain is beyond just your own systems. You have to start accounting for all of those uncontrollable factors. And so again, a big part of that is understanding how your systems work.
I don’t think anybody would expect your systems to operate if the entire internet went down, but at least you know if one of your dependencies or your services goes down, you should have an understanding of what will happen. What does that look like so that you can respond appropriately?
Chas Engineering Patterns
Viktor: Netflix also was known for advocating for certain patterns for building resilient applications to things like circuit breakers. We’re talking about the modern microservices world rather than one monolithic application where if one thing goes down, everything will go down. In the microservices world, some of the components will continue to run in the case of partial failure of some other components.
If you have an ordering system and need information about addresses, and the addresses are provided by another type of API, how would you exist if this service goes down? You need to have either kind of some local state cache that you can use as a recovery procedure. So that’s something that I also learned from the Netflix experience. Can you give similar examples of some of the patterns that we start seeing more and more in application architectures and design?
Jason: Yeah, absolutely. You mentioned circuit breakers, but that also usually goes along with retries. Often, we have circuit breakers, not because of external calls, but because it’s some sort of cascading failure of retries. And how do you tune those? What does your back office look like or when something fails? Is your system just going to start keeping to retry and queuing more and more requests and building that? Soon you'll have that thundering herd problem.
Such things like that are commonly tested with chaos engineering. Many things like retries are easy to test by simply introducing some latency into your system and beginning to dial that latency up and seeing at what point you start to hit your retries and what that does. If it does build up into a queue, or if it handles that gracefully and realizes that maybe that system, although it’s latent and it might be responding slowly, maybe we just consider that as actually being broken and down, and we kill that. We start up a new service and hope that that doesn’t have latent performance.
Phases to Use Chaos Engineering
Viktor: In your opinion, when people start looking to this approach for chaos engineering, should they look at this as something that will run side by side with their production environments, or do they need to start gradually and introduce it in a stage of integration testing, for example. Is chaos engineering part of the production or pre-production integration phase?
Jason: I think it’s both. So often, people are very afraid to just go straight into production. And I think again, that’s a sort of a misnomer, or maybe an evolution from those original Netflix chaos monkey practices. They just went straight into production and started killing things. That's scary. If you’re thinking of that and you’re not scared by that, maybe something’s wrong.
I highly recommend that people start in staging environments where they work just like they would with code. You build things in development; you push them to staging and Q&A before rolling them into production.
And I take the same approach with chaos engineering that by the time you’re pushing into production, it should just be like any of the other code you’re working on. You should have full confidence that it will do what you think it’s going to do.
That said, when it comes to automation, we often find that people start manually, and they’ll start to test just with manual chaos experiments in those pre-prod environments and then manually do them and prod. From there, they’ll usually roll that into their automation-their CI/CD pipeline-for that testing.
Generally, it’s a manual process into production and then automation to prevent regression. It’s typically how I’ve seen it, although many companies work differently. So I have worked with a few customers that are just straight into CI/CD. They know they want this, they know exactly what they’re going to do, and everything has to be automated.
It's Not Just About Breaking Things
Viktor: And this is very important to point out. Chaos engineering or failure injection does not necessarily need to kill things or break services. Jason mentioned introducing latency and seeing how your application will behave. It’s a normal situation when the database becomes slow because either there may be saturation of network because someone else all of a sudden starts using the same channel to your database. Or maybe it becomes slower because the indexes in this database are not optimized and things like that.
When we develop a certain feature, we often think under the condition that everything will be, you know, nice and green and that it would be sunny with unicorns and things like that. However, simply looking at how your system will behave by increasing the latency between components already would be like a big eye-opener. This is especially true regarding how certain things are configured in your system or some of the requests you think need to be synchronous. And when you’re testing this on a unit or integration test, you see the response time would be very quick.
Jason: The breaking things is definitely fun, at least not in production. But when you’re doing that exploratory phase, that is fun. I like to caution people that that's not what it's about, but at the same time again, if what you’re trying to do is understand your systems, it is fun.
You should take time to break your systems to learn how they break and what that looks like, particularly for folks that are maybe not on the dev side, but more on the operations side where you’re on call and things will break, and you’ll get a PagerDuty alert and need to wake up in the middle of the night and fix it.
Having a better understanding of how things break and what that looks like makes you a better engineer. So as much as you shouldn’t just go breaking things, you should break things at least occasionally just to get more experience and better understanding.
Viktor: There is a gentleman, Kyle Kingsbury, and he's known as the guy who breaks databases for the money, profit and fun. So essentially, his idea is to validate some of the consistency claims of the data system. So databases and things like that. And his blog is very popular. His tools are very popular. And it’s a very interesting read and very interesting outcomes from the things that some software vendors can promise you. You can learn more about the system. He also provides suggestions for the authors of the systems to improve this. His goal is not to break the system per se. His goal is to confirm or debunk claims about consistency, which is a huge part of the distributed systems world, in the consistency of the data, especially with the applications and all this kind of stuff. And I think the understanding, like Jason mentioned, of how the system works is important.
How Does Gremlin Chaos Engineering Work?
So let’s talk about Gremlin. So my understanding is that name came from the famous movie from 1984—Joe Dante’s movie. And Chris Columbus wrote the screenplay, which was inspired by the tales of his grandparents who were serving in the US air force. And sometimes when these planes are breaking, people-the gremlins-are doing things-you know, the small creatures that go in there and cause some mayhem. So what does Gremlin do exactly?
Jason: So the lore traces back to Roald Dahl, the famous author who wrote the first book about gremlins, and we have a classic book back in the Gremlin office in San Jose. But that is it. Gremlins were essentially these mythical creatures that would help destroy technology and help build technology. Usually, the destruction was when they were upset with humans. And so that’s where the name came from.
But the tool itself is kind of similar. We go in there, and we help you inject failure in a safe way - safe, controllable way. So generally, you would install the Gremlin agent on your servers. Or so we support both VMs. So you would install our agent as a binary on your machines. Linux and Windows are supported. Or you can do it on Kubernetes. So we have a DaemonSet which will install the agent for those who know Kubernetes, DaemonSet, so you’ll get one Gremlin agent per node.
From there, you’d use our backplane or control plane. You can either use the web UI, which many folks use. But if you’re starting to do that automation - that CI/CD - we have an API. So you communicate with our control plane that would then send out messages securely to all of the agents to figure out what targets are available and then what targets you want to attack and what attacks you want to run. And so from there, it will communicate out, and those agents will run those attacks.
We have 11 types of attacks, including resource attacks, such as consuming CPU, disk space, disk IO or memory. We have state attacks, which are shutting down servers or containers, killing processes or changing the time on a machine.
Changing a time is one of my favorite ones. It’s often overlooked but useful for testing things like SSL certificate expiration or, as Viktor mentioned, you know, if you have a data stream, oftentimes that information is time-critical. What happens when you get things too far in the past or maybe errant data that looks like it’s coming from the future? And then, our third types of attack are network attacks. So that’s introducing latency or killing the network, corrupting or dropping packets and blocking DNS.
Viktor: As a user, what will you see, and how would you visualize it? Like, what’s the outcome after we broke the thing? What would we learn from it? How does Gremlin help developers to build a more reliable thing?
Jason: Yeah, so we often recommend that people follow the scientific process. Think of chaos experiments like science experiments. You’ll start with observing your system and asking yourself a question. And often that question is, does this work the way I think it does? And then to form a hypothesis. What do I think will happen? From there, inject the failure and analyze it.