Recently, I was fortunate to have an insightful conversation with Matt Klein, Lyft software engineer and creator of Envoy, the popular open-source edge and service proxy for cloud-native applications. Envoy was the third project to graduate from the Cloud Native Computing Foundation (CNCF), following Kubernetes and Prometheus. Before Lyft, Matt held positions at Microsoft, Amazon and Twitter, and served on the oversight committee and board of the CNCF. He’s been working on operating systems, virtualization, distributed systems, networking, and making systems easy to operate for over 20 years.
The Q&A that follows includes excerpts from our talk, edited for grammatical clarity and length. A big thank you to Matt for sharing his thoughts.
What are three of the greatest pieces of software you’ve worked on or witnessed?
I think a lot of people will actually find this answer surprising. I started my career at Microsoft, and like any new hire, I worked on some random code bases, Windows CE and things that were very old, like pre-iPhone old. Then, eventually, I switched to working on Windows NT. It was so eye opening to me as a new engineer to see the code quality, the procedure, quality differences, all of these things, even within Microsoft going from the code bases that I had been working on to working on the Windows NT kernel. I mean, it’s just a vastly different experience and working on that code base definitely shaped a lot of my future engineering.
I would say the Windows NT kernel is definitely one of the best pieces of software that I’ve had the opportunity to work on. Windows NT is the kernel that’s in Windows 7, 10 and 11. Beyond that, I take a more pragmatic and practical view of things, which is that the business of software is often dirty. We’re often just trying to make the best of a set of bad decisions. A lot of things I’ve seen frankly have been more focused on business outcomes and less on software quality.
What are key things to pay attention to when it comes to building open source communities around the project?
Starting and running an open source community is no different than starting and running a company. You have all of the same concerns in terms of marketing and PR and trying to hire people—meaning maintainers and contributors. What I’ve found in open source and in leadership in general is that the tone, the way people communicate, and the general ethos of that community are set very early in the process. Humans tend to follow. If there are norms that are set early on, those norms tend to be followed. And so from my perspective, when building a project or building a community, how people communicate and how they treat each other has incredibly profound effects. With Envoy, it’s been an amazing journey and it’s become a very successful piece of software. But the thing that I’m actually most proud of is this community that we’ve built around a piece of software that’s in a highly technical area. It sounds ridiculous and trite, but being nice to people, being welcoming, and making it a place that people want to come and help, whether that’s with the code or documentation, that’s the most important thing.
What you’ll also find in open source and in most of these projects is that it’s not actually often difficult to find people that want to work on the core code because it’s highly technical. It’s interesting. But to grow a large and successful project and community, there are other things that are actually even more important—like working on CI and tooling or documentation or build system management. To find people that will help keep the machinery of the project running [is key]. It’s really important to make sure that you welcome all people and find a place for different people to actually join and contribute at the level with which they’re comfortable. Having a dialogue and being honest about how much time it takes to actually do community work, particularly early on in a project lifecycle, is super important.
If you wouldn’t choose C++ today, what language would you choose?
Almost definitely Rust at this point. I tend to be on the later side of the adoption curve, but my impression at this point is that the Rust ecosystem is sufficiently robust. I suspect that what you’re going to see more of in the industry over time and are already seeing in the Linux kernel is allowing Rust in side by side with C/C++. There’s a lot of ongoing research regarding how to add Rust to something like Chromium, which is a giant C++ code base. The blockers are that Rust and C++ interoperability is a lot more complicated than Rust and C interoperability.
What’s next for xDS? We’re on version three now. (Note: Envoy’s discovery services and their corresponding APIs are referred to as xDS.)
That’s it. There will be no version beyond V3 in my lifetime or at least in my involvement in the project. Envoy’s API is going to be with us possibly much longer than the Envoy code itself. One day someone will rewrite Envoy in Rust or some other language. But I think that the API is going to be living with us for a very long time. Now we’re supporting gRPC and there’s people that are using it internally with other things. When we started to plan the V2 to V3 migration, the project wasn’t as widely used as it is now. So, [the transition] inflicted pain on the industry. V4 is a theoretical thing in the future. It’s probably never going to happen. And the Envoy API, at this point, is used so pervasively that we’re in a position where the APIs are effectively forever.
What’s the status of WebAssembly?
My take on the WebAssembly support, and particularly WebAssembly and Envoy, is that it’s still very early. If you look at the work that Google is doing and you watch the check-ins, they’re still working with multiple runtimes. There are performance problems. So, I think the promise of WebAssembly—the ability to write code in any language, the ability to dynamically update that code in Envoy—is really awesome. We are going to get there, but it’s relatively early days and I think we certainly are going to see a lot more investment in that over the next couple of years.
What are your thoughts around UDPA (i.e., the Universal Data Plane API)? What’s the future look like?
We’ve had some tension within the xDS ecosystem for a long time. Envoy originated and popularized the APIs. But now there is increasing interest in the proxyless service mesh space. So obviously having gRPC support the xDS APIs [is important]. We know of private and internal use cases of other people supporting the APIs within their internal system so they can all interoperate with a central management server. We’ve even heard of plans of other cloud providers potentially supporting some of these APIs within their cloud load balancing products. So for a long time, we have viewed the future of the API ecosystem as beyond Envoy. Obviously, from a practical perspective, this is difficult. Envoy’s the one seeing the most development. Envoy is driving a lot of the code. We’re trying to figure out how we make these things general and how we support documentation across multiple clients. It gets complicated. So several years ago, we had this idea that we should really name these [general use case] APIs something different. That’s where we came up with the Universal Data Plane API. To avoid confusion that arose [between xDS and UDPA naming], the concept of UDPA lives on, but the API will forever be called xDS. We’re committed to doing what it takes to onboarding different clients, different proxies, and different systems.
What do you think of a service mesh interface?
The service mesh space is kind of like WebAssembly—it’s very early days. There are a lot of people trying to scramble to get market traction. There’s a focus on business outcomes right now and not necessarily technical outcomes. There’s a tension between simple configuration and complex configuration. What I typically recommend for people who are building products in this space is to figure out what the set of “easy-mode” functionality is that people want and draw a hard line in the sand: this is the easy-mode and this is how you’re going to onboard people into this lower-number-of-features configuration. You don’t want to end up doubling the API surface and reimplementing every feature. It’s fine to offer a simple mode, but at a certain point, the simple mode has to really translate into the underlying API, or you end up confusing your users more than you’re helping them.
This is a provocative question: What will a potential Envoy killer do better than Envoy today?
I think the project handles security really well, so I’m not really worried about that aspect of it.
The Envoy killer, in my view, is the cloud-native space over the next 10 to 20 years. I’m a big believer in the concepts around platform-as-a-service and functions-as-a-service. As an industry, we don’t want people to care about Kubernetes. We don’t want people to care about Envoy. It’s just too low-level. At the end of the day, think about the actual business logic developers do on a day-to-day basis. They want to write an API. They want to read and write to Pub/Sub, databases, caches, and call their services. If we can give them a containerised system that will offer that functionality, [that’s great]. Technologies like Dapr are actually really interesting—giving a general API for people to write services. My point is that if you look at where things are and where they’re going around services like (AWS) Fargate and a bunch of other platforms or functions-as-a-service over time, [business-oriented developers] don’t care about Envoy and Kubernetes. They just want to run code and do a small set of operations and just have it work. My general feeling is that the killer for some of these technologies is that they just disappear from view and they can be replaced with anything. If you’re running on a general lambda-based system, who cares what the ingress is like, whether it’s Envoy or something else. Who cares if you’re using Kubernetes? With an open community like Envoy and the functionality and the feature set that it has, I’m skeptical that something is going to come along and kill it in a similar way. I don’t think there’s going to be another Envoy. Instead, I think you’re going to see platforms that are going to kill it. Those platforms might use Envoy, but people won’t know that they’re using it. It’s just out of sight, out of mind. It can be replaced with any other technology, with a very limited feature set for what that platform needs. So that’s what I would say would kill it. Over time, people won’t run this type of software directly anymore. And then, at that point, you can pretty much replace it with whatever.
Yes, it becomes invisible.
Exactly. For a lot of the infrastructure companies, frankly, if they’re not thinking about the platform future, [that’s a problem]. The world is moving to platforms. So, in the next 10 or 15 years, I think that’s just the way things are going to go.
What do you see as being the main use case for Envoy Mobile and what do you think the opportunity is in this space?
With Envoy Mobile, we recognized that the client is really the most important part of a distributed system as it’s the primary user interaction point. Being able to have a consistent networking stack and offer value on top of that directly at the client layer around observability, security, routing and other things is just as important, if not more important, than [work] on the server side.
Mobile networking is very complicated. Where we are from a project perspective is that Google is now working heavily on Envoy Mobile, which is a huge achievement. So Envoy [could] be in the core networking stack of all of Google’s apps. So I’m still bullish on what we can do with Envoy Mobile.
Isovalent, the company behind Cilium, recently published an article stating that the future of service mesh is per node, not a proxy per replica, essentially per pod. What do you think about this approach? Proxy per node or deploy as a sidecar next to services?
Larger entities with more infrastructure footprint might be more sensitive to overall resource requirements. In terms of per pod or per node, there’s pros and cons. Per pod is simpler to reason about from a security perspective. From a configuration perspective, per node is much more complicated, but obviously it can have benefits in terms of reduced resource usage. I know that people are now thinking about even hybrid approaches, for example, using a Sidecar Micro Envoy, which does almost nothing other than setting some headers and handling some of the security and configuration encapsulation, but then basically forwarding it to either a daemon set or an off node proxy that would do more of the heavy lifting. We’re at the early stages of things with service mesh, and I encourage people to go after the problems people are trying to solve. I encourage people to focus on actual technical needs and not the theoretical stuff that people tend to talk about.
Is there any open source project, other than Envoy, that you are excited about that the open source community has built?
Dapr is actually very, very interesting. I like the idea behind it—basically a set of cloud-native APIs that have abstractions that could be implemented. Think about a future world [where] you couple something like WebAssembly and Dapr and some other pieces. Envoy can actually be an application runtime. It offers a tremendous amount of functionality. The cool thing around something like Dapr is [the hope it offers of getting] to a world in which people can write to some set of APIs for caching and database or whatever, then we can run those either on-node or off-node and have different implementations of those APIs. That’s pretty interesting. And I think that gets us closer to this platform world and closer to the world in which a lot of these technologies, like Istio and Envoy and Kubernetes and whatever else, just don’t matter. They’re just an implementation detail, and that’s what I think most application developers want.
Thank you, Matt!