October 4, 2021
18 min read

Services Don’t Have to Be Eight-9s Reliable

Viktor Gamov

In our first episode of Kongcast, I had the pleasure of speaking with Liz Fong-Jones, principal developer advocate at Honeycomb, about the concept of error budgets for service level objectives (SLOs) and how to accelerate software delivery with observability.

Check out the transcript and video from our conversation below, and be sure to subscribe to get email alerts for the latest new episodes.

Viktor: You are known in the observability community and SRE community very well. I've followed your work for a while during my time at Confluent, so I’m super excited to have you on the show. Can you please tell us a little bit about yourself? Like what do you do? And what are you up to these days?

Liz: Sure. So I’ve worked as a site reliability engineer for roughly 15 years, and I took this interesting pivot about five years ago. I switched from being a site reliability engineer on individual teams like Google Flights or Google Cloud Load Balancer to advocating for the wider SRE community.

It turns out that there are more people outside of Google practicing SRE than there are inside of Google practicing SRE.

I wanted to help everyone in the community share best practices and have more manageable systems. So that journey two-and-a-half years ago led me to work at Honeycomb, where I help people think about making it easier to debug their systems.

Viktor: That’s great. And before we go into this aspect, I want to explore a little bit about your past. Is it fair to say that you coined the statement that the "class SRE implements DevOps?"

Liz: Yeah, that was definitely something that Seth Vargo and I came up with. And then that became a chapter in The Site Reliability Workbook.

Viktor: So let’s talk a little bit about this. Tell us your definition of DevOps and how it’s different from SRE.

Liz: Yeah, I think there’s this interesting pattern that’s happened in the early 2000s. Different organizational needs sprung up to solve how we wrangle needing to ship software faster and more reliably.

What happens if the software that we write is critical?

And one of those movements was DevOps, which was a cultural movement that really aimed to make it possible for software engineers to run their own software to kind of practice continuous delivery, kind of practice agile operations.

And the other phenomenon that we saw was SRE, which happened at Google, where Ben Treynor Sloss, the founder of SRE, said we need to take a different approach to run Google’s systems. It was a similar set of problems and almost a similar solution to those problems. To start, the solutions were encoded in Google’s particular organizational structures.

And over time, what happened was these two communities were initially a little bit at each other’s throats of who came first, whose approach is the right approach. But as one of the people from Google who had a lot of interactions with the outside community, I looked at this, and I was like, "Wait a second. DevOps principles - every single one of them - are implemented in site reliability engineering. And we might call it something different, but it’s like a concrete, opinionated implementation of the DevOps principles."

DevOps gives you a lot of freedom of thought to how do you do things.

There are some best practices from the community. But I think what I see happening more and more is that both the DevOps and site reliability engineering communities are actually turning into one community that is starting to become known in the field as platform engineering.

Instead of having specialists in operations, you are building infrastructure for product teams at your company, figuring out what’s the right platform, how to contribute to it, how to make developers productive. So it kind of encompasses more than just production operations. It encompasses things like continuous delivery and encompasses things like build infrastructure and composites, things like reliability measurements - kind of this broader field that I think has a little bit more leverage.

Viktor: So this is a very good point. And usually, when you talk about this reliability, for many people who are not doing this as their day-to-day thing, it might sound like something intangible or very difficult to measure. What exactly is this, and how would you measure reliability? Because the people want to have their systems reliable and run 24/7. But if it runs, it runs. If it doesn’t, it doesn’t. What is in between, and how do you measure this like a fuzzy thing in between? What would be the right metric to look at?

Liz: Yeah. When I introduce people to the idea of a service level objective of an error budget, it kind of blows their minds a little bit. But the way that I concretely anchor it is by pointing out that it doesn’t matter how reliable you make your service. It doesn’t make sense to spend billions of dollars launching satellites into space to make your product more reliable or get that additional nanosecond or femtosecond of uptime. If ultimately, at the end of the day, people access it from their phones, phones run out of battery and phones run out of cell signal.

Your cell phone, at best, is going to be about 99.95% reliable. So why would you make your service six-9s, seven-9s, eight-9s reliable?

It just doesn’t make sense. It’s a poor investment of money, and it's a poor investment of your energy too. All the energy you’re investing in over-investing in reliability could have been spent on innovation.

So that’s kind of on you. Figure out the customer's minimum requirements for reliability. Are they OK with hitting refresh now and then if it doesn’t work right? And after that, you can invest everything else in features. So it’s kind of helping quantify that trade-off.

Viktor: Is it related to the service level agreement with our customers stating that this is the level of service we provide and why you’re paying us money? We're mostly talking about the SaaS teams, but it’s also applicable to teams who run some internal software. Maybe it’s not available outside the Internet, but many companies run internal platforms. The system just needs to be reliable.

Liz: Yeah, that’s a really great question.

Service level objectives do not necessarily require you to have a service level agreement.

And often, your SLO and SLA will be different even if there is an external customer with an SLA.

You have to think about the SLA. That’s a sales contract, and that's a legal contract. What’s the minimum amount of reliability, or else we have to pay you money? Or give you a refund. But when we think about customer happiness, we should aim to make customers happy and delighted, not just, you know, not suing us.

So I think that that concept of thinking about what we are defining as a good user experience translates regardless of whether you’re serving internal or external customers. Figure out what your customers need.

A product manager is a user experience researcher. These people can be very helpful when trying to understand your users' needs, whether internal or external.

Viktor: Yeah, and the important thing for me, as you mentioned, we want our users to be delighted, but we don’t want our users to sue us. So that’s a very important thing for how the engineers in the, for example, sales organization look at things. The engineers want to build the systems that the user will enjoy using.

Liz: You want your engineers to know if something is in danger of violating the SLO long before you get to the point of SLAs being involved. That’s kind of why you need to be a little bit more aggressive with your SLOs.

Viktor: Yeah. Can you give us some good examples of structuring service level objectives for the teams? Maybe some of the pointers to interesting reads where they can learn a little bit more about this?

Liz: So I highly recommend things like the Google Site Reliability Engineering book. And there actually is a recent publication by my friend Alex Hidalgo called Implementing Service Level Objectives, and both of those are published by O’Reilly.

But I think when it comes down to concretely understanding it, let’s talk through an actual example. So at Honeycomb, we are a company that understands your production systems. We have to adjust telemetry from your production systems to encapsulate transactions flowing through your microservices and application.

So one of our service level objectives is that if you send us an event, 99.99% of the time, we’re going to ingest it successfully and make it available for query. That means that less than one in 10,000 events can be dropped. That does not mean that we’re aiming for perfection. But it also means that it is a stricter threshold than one in 100 or one in 1,000 events that can be dropped.

It’s striking a balance that enables us to innovate and make production changes, while at the same time preserving your trust in the fidelity of the data that we’re giving you.

So that’s kind of the example of an SLO, which says, you know, this is our target - 99.9% - and how we measure it. And the period that we measure it over, for instance, we say we’re aiming for four nights over a 30-day window.

And roughly, if we had a full outage, we could only do that for 4.3 minutes before we blow our SLO and say, "OK, we need to reset. We need to revisit reliability."

To get back to your point about full downtime up or down. It's not up or down. The answer is sometimes it’s kind of up and down.

Basically, at a steady state, we might expect to serve maybe one out of every 10 to the fifth. One out of every 100,000 requests might just time out or get dropped, and that's just natural noise in the system. Or we might accidentally serve like one percent errors for a little while. And if we’re serving one percent errors, we can tolerate that for a little longer than 4.3 minutes. We can tolerate that for, I think, 400 minutes. That gives us enough time to respond to correct the anomaly.

We stop thinking about things in black and white, up or down. We think about things and changes and like, how bad is it really? Well, it depends on how far down we are.

Viktor: Yeah, and I think you mentioned a very good point about these numbers. These measurements - the thresholds allow us to have the internal ability to make the changes we want. So it’s not like, "We were down already this month for two minutes. Like, can we afford to be down another minute to do a quick deployment of a very important feature?" That’s something that you were talking about.

Correct me if I’m wrong, when we’re talking about the error budget. You can afford to fail, and it should be kind of like in the mindset that it’s OK to fail because computers are notoriously unreliable - people even more notoriously unreliable. So mistakes will happen. So we can work around this, but there should be some threshold.

And to the point of this threshold and this innovation, what’s your take on "move fast, break things" - that mindset that was emphasized at Facebook back in the day? But there's probably a middle ground somewhere, right, that you cannot move slow because in this time and this year’s innovation, you will die as a business?

Liz: Yeah, that’s where I think we were reformulating Honeycomb’s values recently.

One of the things that stuck around is this idea that fast and close to right is better than perfect, and you still have to be close to right.

You don’t necessarily have to put the perfect polish on everything because you do have to innovate. So I think what encapsulates a lot of our thinking around that is that there are some things that you absolutely have to get right.

For instance, privacy is a thing where you permanently lose people's trust if you get it wrong once. I think that when it comes to the kind of broad reliability and for less critical things, yes, you can afford to break them as long as you don’t do it too often, and it's not breaking things all the time. What level do I need to preserve my users' trust and then aim to that level?

Viktor: Yeah. And I think it’s a good mindset to have as a part of a company's culture. No one will fire you if you make a mistake, and however, you should try not to do so very often…

Liz: Don’t be reckless. But if you do make a mistake, we have a blameless culture. We are interested in understanding how it broke and how we can fortify the system so it doesn’t happen again.

Viktor: Exactly. And why this happened, and how will we not make this happen in the future? How can we make the system strong enough?

What’s your take on continuous delivery? Like how often do you deploy the production system at Honeycomb? I remember there were some of the magical numbers from the bigger companies, like Netflix and Amazon, and they deploy thousands of times per day in production. What’s your deployment cadence at Honeycomb?

Liz: We're deploying to production about 12 to 14 times per day, and we would like to make that faster.

To us, it’s not necessarily the number of times that we deploy per day that’s really what matters. What matters is the latency.

How long does it take between a developer writing a change and landing production? And for us, it is less than two hours - typically less than an hour even. And we think that that’s really the sweet spot. You know, ideally, we’d like it to be closer to 15 minutes. Right? But 15 minutes to one hour is the sweet spot for kind of the time between writing a piece of code, getting code review and sending it to production.

Because if you have that short feedback loop that enables you to watch it as it goes into production instead of walking away to get a cup of coffee or coming back tomorrow. And that gives you a degree of confidence in your code doing exactly what you meant it to. So I think that that’s kind of the critical number.

Obviously, the kind of Netflix's and Facebook's have many more engineers than we do. So, of course, you’re going to need to deploy to production many more times per day than we do, and I don’t think that’s a meaningful metric.

I look to the DORA metrics and the State of DevOps report by Accelerate. I think that those metrics are a more precise encapsulation rather than how many times you deploy.

Viktor: It’s very good to point out that it’s not about how often but what the quality of this release would be…

Liz: Exactly. Like what percentage of your releases are rolled back? Or what percentage of them fail catastrophically? Which ones can you break? And a failure is OK as long as you have some kind of rollback mechanism. Do you have the ability to route traffic? Do you have the ability to quickly turn off a feature flag? I think that those things can really mitigate the allowed amount of failure in your releases.

Viktor: When I was in consultancy and worked as a solutions architect in my previous jobs, we talked about regression and ensuring that we were not doing the worst of our system.

I usually tell people that before you start doing this, establish a baseline. This baseline would depend on your external SLA, internal SLOs and things like that. However, you cannot just say we want to do two times faster because you don’t know two times from what point. And usually, people talk about how we can increase throughput or reduce latency. But from what point?

And we are talking about observing the system as soon as possible and putting these measurements in place as soon as possible. I’m not talking about performance measurement, but at least some numbers you can understand about how the system performs.

Observability is one of the pillars and one of the…how does Seth put this? One of the pillars of DevOps and methods SREs practice is to provide a measure for "everything." But everything - not necessarily everything - but everything that matters.

Liz: Yes. I love that formulation, everything that matters.

Viktor: I would like to hear your philosophy around this since this is your bread and butter and a little bit of honey on the top…pun intended. What kind of tools, what kind of practices do people need, in your opinion? What do you say when you start consulting with the customer and explaining why this is important?

Liz: Yes, I often point to time to recovery. SLOs cover time to meaningful detection of how long it took until you detect something impacting customers.

It’s equally important to pay attention to making sure the signals and telemetry coming out of your system will enable you to understand why things are broken, how things are broken, for who are things broken.

Often, problems relate these days to complex interactions of specific services at specific versions for specific customers. And you’re never going to turn that up if you kind of have bulk aggregate statistics.

So when we talk about measuring the right things, I think the primary source of data that I encourage people to have is distributed tracing data that if you have a service mesh or if you can instrument your code, that ability to trace the execution request flowing through your system really helps you understand where is this latency coming from.

If I’m in danger of blowing my service level objective, if I’m supposed to be achieving 200millisecond page loads, and instead I’m starting to serve a lot of 500millisecond page loads, being able to track down where that extra 300 milliseconds are coming from is super powerful. And it gets even more powerful when it’s not just anonymous traces in the system. You can figure out which microservice maybe. But also to have things like, you know, region and language. To have all of these meaningful fields attached to your traces so that you can then kind of slice and dice and iteratively explore and understand the commonalities.

I think that that really helps you debug things faster because the old approach of, you know, this measure, everything was "Let’s monitor everything, let’s set alerts on everything. CPU over 90 percent, alert on it. Disk over 90 percent, alert on it. More than 500 requests per second to the service, alert on it." You get lost in all the noise. Right? And that’s why I love this thing that you said about measuring what matters. Because measuring everything turns out to result in you drowning in noise.

Viktor: Exactly. It’s like the boy who was screaming about the wolves. And all of a sudden, you don’t care about the wolves anymore because it happened too often, and you just kind of get used to it. One of the important things for people to understand is that they need to establish the metrics that are actually actionable or like some threshold. Maybe it’s just like a CPU spike for a couple of seconds. You know, things happened. As I said, computers are unreliable. Software is also unreliable…

Liz: It doesn’t matter unless it has an impact on users. And that impact on users, we’ve already defined that in terms of our service level objectives.

Viktor: Yeah, yeah, I agree. So distributed tracing is the thing to implement into the system. There’s plenty of different products - some open source, some of the SaaS companies that implement like I know that you are participating in some of the open telemetry initiatives where it exists too. First of all, it’s actually the merger of two projects, as far as I understand.

Liz: Yes. So in the past, there were kind of three different approaches you could take when you wanted to implement distributed tracing. One of them was to go with your vendor’s proprietary solution, and then the other two were open tracing and open census.

One of them was founded by the Jaeger and Zipkin communities, and Google found the other one. And kind of basically all these folks came together and said, you know what, it doesn’t make sense to have this plethora of different vendors and open census and open tracing. Let’s just make it into one project. Let’s make it so that you can instrument your code once and have all these common integrations and no longer need to worry about changing vendors if I want to change from an open source solution to a vendor solution or vice versa. Do I need to change everything? Can we just make it simple and easy for everyone?

So, yeah, I think that that is an approach that has now paid off. We have been officially announced as a CNCF incubated project, and we're the second most active project in the CNCF after Kubernetes. Of course, Kubernetes is kind of a lot bigger than us.

Viktor: And it’s never to live up to numbers of Kubernetes, yeah.

Liz: But we have hundreds of companies participating. We have hundreds of thousands of engineers who have contributed at one point in the past. It is a sign of project health and interest that so many people have said that instrumentation is table stakes. Our users are complaining about how much effort it takes. Let’s just make it simple for everyone to instead focus on innovating on data analysis, which is the more interesting bit.

Viktor: That’s very interesting. So you mentioned hundreds of vendors and probably thousands of users, too. And I was always wondering what’s a good sign or how to steer this community to keep it healthy. Some vendors might have some great ideas or some great implementation, and they want to do this, and some other vendors will have a pushback. Do you have experience, too? How can different vendors and users collaborate in a healthy way?

Liz: Yeah, I think that we admire the Apache Foundation projects where everyone who is working on the project may be employed by a vendor employed by someone else, or maybe a volunteer.

But ultimately, we anchor things in the right thing to do for our customer base and what best suits our mission. All of us are trying to represent those perspectives to serve our customers better.

So, here’s a concrete example I can give. There was a lot of pushback initially against implementing a kind of sampling protocol for open telemetry. There was a lack of clarity about providing like a number for, you know, for sampling rate. What does it mean? Can we make it less ambiguous? And some folks were like, yes, we need sampling and open telemetry before reaching the backend. And other folks were like, our backend can handle it. We’ll handle all the sampling on our end like, you know, please don’t add sampling rate. So we had a constructive discussion about it. And we’re now kind of finalizing a lot of these recommendations in the sampling working group.

So I think that’s kind of an example where people from different perspectives and backgrounds came together to figure out what would best serve our customer needs.

Most crucially, we didn’t necessarily put it directly into the 1.0 release of the open telemetry specification. We agreed we were going to kind of back burner to get the most critical stuff out so that people could immediately benefit from open telemetry before we started layering complexity on it. So those are some of the technical tradeoffs that we make every day.

Viktor: Yeah, that’s a good point. By the end of the day, it’s not about tech, it’s about how we can communicate our ideas in a civilized way without implying the egos or selfish interests and things to make the world, like it will sound cheesy, but to make the world a better place by having the civil conversation about certain things.

Liz: And I do genuinely believe in that - making the world a better place thing. I spent so many years of my career watching developers and systems and engineers struggling with bad systems or the wrong tools.

Those of us who work on the developer tools space recognize that there are so many developer lives that we can make better if we help them do things a better way, if we help them have less downtime if we help them have less frustration.

Viktor: Yeah, exactly. And I would drink to that because I also feel the responsibility to help show the right ways, right tools and so people can build better software.

Developers are always serving business. In 2021, we cannot run the business successfully without developers because everything is digital right now, and everything is software. Software is everywhere. And it’s inevitable to do something without it.

You can sell lemonade, and probably you don’t need to do any software, but still, you need to file the taxes from your income, and this is where the software will help you…

Liz: Or other things like, you know, during the pandemic, Honeycomb actually signed one of our customers, which is H-E-B, a grocery chain in Texas. And part of the reason why they realized that software was so critical was in that moment when they shifted towards digital purchasing to have people check out online and have groceries delivered into the trunks of their cars. I think that’s a moment that highlighted that digital services are fundamental to our economy and people being able to eat.

Demo: Honeycomb Service-Level Objectives

In every Kongcast episode, we ask our guests to show a cool tech project they've worked on lately. Check it out in the below video:

Thanks for Joining Us!

I hope you'll join us again on October 18 for our next Kongcast episode with Viktor Farcic from Upbound called "Why Developers Should Manage EVERYTHING."

Until then, be sure to subscribe to Kongcast to get episodes sent to your inbox (and a chance to win cool SWAG)!