Work Smarter, Not Harder: Using ML to Observe Your Kuma API Metrics
Ricardo Ferreira, Elastic & Viktor Gamov, Kong
Observability is catching on these days as the de-facto way to provide visibility into essential aspects of systems. It would be unwise for you not to leverage it with Kuma service mesh — the place that allows your services to communicate with the rest of the world. However, many observability solutions restrict themselves to the works: simple metric collection that provides them with dashboards. Expecting users to simply sit on their chairs and look at those metrics all day long is an invitation to failure, as we know that one can only do so much when they get tired and bored.
This talk will change the status quo and show how you can work smart by combining the flexibility of Kuma with the power of the Elastic Stack to ingest, store and analyze massive amounts of data. Join to learn how to collect metrics from Kuma via Prometheus, bring these metrics into Elasticsearch using Metricbeat and create machine learning jobs to look for anomalies that can alert you when something interesting happens.
TranscriptionSpeaker1: [00:00:04] Ok. Welcome, welcome to Kong Summit 2021. We are super
excited to talk today with you about some, some interesting things. And my name is
Victor Gamov, the developer advocate here at Kong and today with me. I do have my
friend Ricardo Ferreira as my co-presenter here and we do in the virtual first bump. But
yeah, all right, I'll try this. So it's a good time to check if you're in the right place because
right now we're going to be talking about service mesh and some of the observability
and extension of this observability through ML. The session work smarter, not harder.
Before we jump right in to this one. Let me draw the picture like, let me do some
exposition here, what we're trying to do here. In the world of that, we live right now for
the past couple of years. Give or take right, Ricardo? I think we were talking about
microservices a lot. We're talking about microservices development, the development of
using different frameworks, how we can deploy this, how you can communicate it. But
essentially, we're always forgetting about the things that might be surrounding
microservices. Once you're done with your implementation, there's a lot of things that
you need to be doing on the day to. That's what we call the day to day responsibilities.
Like, everything starts as easy, like when you have one service, everything is nice and
nice and easy, you know, you know what the framework to choose. You're doing. You're
doing great. You put a lot of stuff there. You deploy this. You happy. So the next thing is
that you're starting a adding more features.
Speaker1: [00:01:42] When we communicate to those microservices through different
APIs, it can be HTTP, it can be on your PC, it can be Kafka or whatever. What have you
what what like you very soon you are so excited and you're going into your
microservices becoming really big, really powerful to do a lot of things there. But all of a
sudden, not all of a sudden over time, all the time these things are becoming your
microservices deployment. Your deployment is becoming a very dark place, so dark that
you basically if you didn't put the proper tools around observing your services and
looking inside what is going on, you very soon will find yourself a bunch of like dead
cold in places. Or maybe that services or some broken services that that don't work as
the way you intended to. And as a as a character from the It's Always Sunny in
Philadelphia, your character played by Charlie Day, you were trying to find who's the
proposal you're trying to find where the root cause of certain problems and things like
that. And with this, we would like you to share some of the ideas and some of the tools
that you can apply. So you will not be like a charity worker with your Microsoft
deployment. This. My name is Victor. Again, I'm a developer advocate here. You can
find all information about where I am and how to contact me. And today I'm sharing the
stage with my friend, with my friend Ricardo. Take it away. It's an on your side of things.
Speaker2: [00:03:17] Sure, thanks, Viktor, and yeah, welcome everybody for the
session, I hope at this point, you know that you were in the right place, and for those of
you that don't know me, my name is Ricardo Ferreira and I'm also a developer advocate
but for this company called Elastic and my friend is a former colleague and old friend.
So it's a pleasure to be on the stage with him. So without further ado, let's get started.
Yes, start.
Speaker1: [00:03:42] Now, go ahead. No, I yeah, I was like, what the usually people
ask, like, what new will you will talk to us like, what new things you will explain is right?
Speaker2: [00:03:50] And that's I think. Wait.
Speaker1: [00:03:53] Right, exactly. I think this is a good place where you can talk a
little bit about this, how we end up here. So I draw the picture, how we started this
microservice journey or many people started this microservices journey. And of course,
when I draw this like a real dark forest, it is not like many people will have because
people know that the monitoring of the system is important. So that's why many people
putting many tools in place right to do this monitoring.
Speaker2: [00:04:25] Exactly, yeah, exactly, yeah. And I think that's important to also
highlight the fact that back in if you go back like a 30, 20 years ago, we used to have
this very singular approach about how to do more right like we used to have. Like all the
applications where basically a cluster of Apache HTTP servers where we could easily
access each stand and then tell you the logs to see what's going on. But now we're
living in an era where everything is highly distributed. So the way we used to do
monitoring right? It used to be simple, right? We used to have just very like simple
metrics like, Oh yeah, let's just monitor the CPU utilization for the web servers or for the
database and everything's go up. That's what consider
Speaker1: [00:05:08] Deploying Xbox and a bunch of agents and you have a
dashboard when you see CPU and memory utilization.
Speaker2: [00:05:15] Exactly, yeah. But if you look at closely, we are dealing with, just
like these are highlighted in the beginning, we're looking to distributed architectures
were simply measuring the CPU utilization or memory or network or whatever hardware
resources you want to measure, it's no longer enough, right? And the reason being is
because the architectures are now distributed. So, for example, any in the SRE Site
Reliability Engineer award, we have what we call the golden signals, right? There are
four of them: saturation, traffic, error rate and latency. Like, for example, how do you
manage your saturation? Like, it cannot be as simple as the CPU utilization from your
web server because web server now is only part of your architecture. What about the
Kuma service match? What about the actual microservices? What about the data
source? So you have to look for the whole thing right now, and the challenge is that you
have to change the way you are doing things because, right? And that's the one million
dollar question. What changed, right?
Speaker1: [00:06:15] Let me let me stop you here. So when we want you talking about
the past and the way how we did this in the past, it's why I brought up some of the, you
know, the technology that we used to use for, you know, for for monitoring. It was much
more easier to operate those fleets because it was a first of all, you can count the
number of servers. You always know how many servers you have like I used to have. I
used to have a number for my servers that came from names of the planets in the Solar
System. So not so many planets, maybe sometimes in satellites. So in this case, like I
knew exactly what's happened because one server was for database. In other server
was for web server, and it was much easier to do things like you said, the situation. If I
know that the overall resource utilization was on this particular machine is really high,
meaning that it's some of the processes are hogging these resources.
Speaker2: [00:07:11] There is a lot of interest in right like we could inference that, all
right, if the database has CPI computerization is because the whole application is
overloaded. But now what is the concept of the whole application? If you have a
distributed architecture and Victor, you were, you took the words out of my mouth
because the way you describe it when you were used to like a set planets as the name
of your servers is the phenomenon of what we call the patch versus cattle. Right? The
patch versus cat is where back then at a time where we used to give names and we
knew on the top of our head, each server in each cluster that we used to have for our
environments. But now we are dealing with cattle like there's a bunch of containers and
several instances popping up every second, and you don't know where they came from.
Well, technically, you know they came from, but you don't know
Speaker1: [00:08:02] They came from your cloud provider because you swipe the card
and now you have a 10 more nodes and you don't really care if it's how they would
provision that you just need them, right? Because like a weird, weird ending at the end,
like there's some names I don't know Jupiter, like Fastly, Jupiter, A B C D e f g h.
Something like that. They have some
Speaker2: [00:08:25] And there's a hashing as suffix,
Speaker1: [00:08:27] Right? Exactly, yes.
Speaker2: [00:08:30] Yeah. But one thing that you is important to note is that there's
one thing that you were worried about which is billing, right? Just like Victor just
mentioned, they're all running. They're all cloud resources, one running a cloud
provider. And if you no longer have control about how many instances you have. Guess
what? In the end of the month, you will know how much you actually going to spend with
this. So. Control is a must in this new world. So this is the first logical thing. And also, I
would highlight also the fact that the world is more connected than it used to be like.
Like, I remember that when I was starting my career in software engineering and then I
used to develop these websites and then I would I was always has been like a crappy
software engineer, so my website was never I was never too much highly available and
some of the time
Speaker1: [00:09:17] I don't know if things will change from your perspective qualities,
because what you meant is scrappy that you collect some monies. At the time, you
probably improved from the crappy for to be excellent software development. But still,
maybe scrappy is the would be thing that you try to calculate how much money you
want to spend here.
Speaker2: [00:09:38] That's true, that's true. So when I used to fill those websites and
when they were down, mine users wouldn't be like an extremely unhappy with this. They
would tolerate this. But think about this when it's put in. When we're living in a time
where users will no longer tolerate downtimes, they no longer tolerate high latencies. So
we have to think differently in terms of architecture and developers are building
microservices. Architectures are highly distributed and now we have a problem in hand,
which is the problem of properly monitor and observe this thing right? What do you
think, Victor?
Speaker1: [00:10:16] Yeah, so the problem with this, with all these things, yes, you can
enable all possible probes. You can enable all possible agents to deploy. However,
there is one thing that I learned during my time being in in the consultancy business.
Yes. If you're not measured, you're not controlling. However, if you're measuring over
like certain threshold and you don't know what to do with this measurement, it's useless.
You can put all sorts of dashboards around your office. You can show nice graphs. You
can report all these metrics back to whoever. People will be happy to look in those
ones, but you don't have a control here because there is no actions of those. So those
metrics, those observability pillars, needs to be filled out back. So we need to have
always like it's not just like a show in metrics, it's observability proof observability. We
can make some actions in order to improve or doing something differently with your
system. So I think we need to kind of like approach this slightly different from
perspective of stop collecting data. You know, it's important to understand this.
However, let's do something useful with this data.
Speaker2: [00:11:24] Yeah, I would say stop just collecting data, right, because people
used to say that. All right. Yeah, observability is all about the three pillars that comprise
this trace's matrix and log. But I think in the end of the day, yeah, it's one of them has
the value, of course, like a matrix are good for giving you an insight about what's going
on. Traces can point out where the problem is, and logs actually explain what the
problem are. Yeah, but in the end of the day, we need a consolidated view of everything
so no longer thinking about pillars, right? Individual and silos pillars, but actually pipes
where all of them kind of work as streams that will be stored in a single location, where
in the single location we can have much more like analysis and insights, and because
all of it are going to be correlated. Now you can have a much more precise perception
about what's going on.
Speaker1: [00:12:13] So exactly. So are we going to we're going to be talking about
specific tools today. We're going to be talking about specific strategies, you know, stack
of technologies, right? We're going to be talking about, obviously, if we have a recorder,
we're going to be talking about bit of elastic. Obviously, I'm here are going to talk about
Kuma and what kind of tools available there. So you as a developer can make not
educated, not to guess, but an educated guess. You know, you will be aware about all
the tools that what those tools can make for you.
Speaker2: [00:12:43] Yep, exactly. So let's start with this. Why don't you show
everybody, how can you actually enable metric collection from Kuma service manager?
Let's start from there. Yeah.
Speaker1: [00:12:55] So we're going to start with what we call it metrics maybe not 100
percent correct thing here, because we're going to showing some how to enable all the
possible observability pillars, right? So we're going to start with logs. We're going to go
to metrics. And after that, we're also going to talk about traces. And for this, I'll let me
quickly switch to my screen and I'll show you this quick manifest. So when I enable this
service mesh in my. And in my cluster, I will be able to do multiple things when I do find
a spec for my service mesh. I can enable this logging section with the logging section. I
can pick up a few different solutions. In this particular case, we're going to be using log
stash because and Ricardo will explain this in future with logs that we can specify a
format, how the output, the traces. So there is a different approaches, how you can,
what we call it, like a prepared or like a massage, the data before it will be consumed by
the system. However, we want to prepare it in the format that it would be much easier to
digest and we'll need to require any other work. So and this thing will be once I apply
this. Once I applied this spec, those logs will be automatically flow from all your
microservices, from all your services that you have inside this mesh, which is very neat
and very nice to have. And so once we do this now, we're moving into into Elastic. And
this is where the step that we did before step of formatting data in the right format will
take, you know, will be taken into account. This is one of things.
Speaker2: [00:14:49] Let's discuss about, right? Victor just showed how to how can you
enable these telemetry data, which is metrics in this case and expose it in your Kuma
service mesh? Now how can we bring this data into Elastic, right? But the first question
that you should ask yourself is why Elastic, right? So Elastic has this machine learning
capabilities built-in that you can leverage to do the analysis for you? And this is exactly
what we're trying to accomplish here. So think about the objective of this let's call
ingestion layer to ultimately get the data into elastic, right? So that's your objective. How
can you do this? Well, luckily we have options, and I'd like to bring up this slide where
the shows the menu of in and out because Victor and I, we are in much love of In-n-Out.
Speaker1: [00:15:37] So since the Kong summit. Yeah, it seems to Kong Summit it is
even though it's virtual. But we're trying to bring some of the West Coast dynamic to to
to you. And I hope you enjoy all this. Like a cool break so far in and out is our favorite
place to go, and there is like a different, uh, the menu of in and out rather simple.
However, it also has some kind of secret, but it's not very secret menu for more
advanced users. So that's why here we want to present your menu of the things how
you can do things in an observability world. So that's why this animal logic here, plus it's
always good to get the few flying Dutchman.
Speaker2: [00:16:17] Exactly. So it's all good to have options. So the first option I would
like to discuss is Metric Beat. Metric Beat is a component of the Elastic Stack in the
basically what metric B does is to scrap periodically the Prometheus endpoint that has
been at this point enabled by the Kuma service match and then Metric B, it can read the
data and store into Elastic. But the most important part is that it's just only stored, but
also formats the data into a format that will be will enable data analysis, right? Is not just
a huge string that will be stored, so there will be layoffs, there'll be columns, data types,
everything we need to actually start working with the data. Metricbeat is interesting
because you can apply it as a bare-metal Kubernetes Docker, and you can quickly spin
up one instance. You definitely should explore Metricbeat as a sidecar for your
microservices, and the most important thing that a metric beat also does is to provide
options for you to handle the load. Because, yeah, if the cycle. All right. So.
Speaker1: [00:17:24] Ok, we got this metro. Are there other options available for our
listeners, watchers and users?
Speaker2: [00:17:32] Yes, there is another one, which is the collector from the open
telemetry project, so with the collector does is pretty much the same thing that metric
does, which is scrapping the Prometheus endpoint exposed by Kuma Service Mesh.
The only difference is that the collector will store this data into Elastic into this form and
call OTP, which is the native protocol that open telemetry supports. And then the elastic
is compatible with this protocol. Right? So. Just like the metric beat is can be deployed
as a sidecar, whether if it's Kubernetes, it is darker or bare metal and there's a bunch of
options that you can tune in to actually keep up with the load, so it's highly configurable.
There are actually the very convenient option if you want to stay like right, adopt an
open standard because in the end of the day with Kuma service matches is an open
standard, basic on Envoy proxy. So the combination with the collector and open
telemetry and Kumar, I think, is a good deal.
Speaker1: [00:18:34] So I do have a question here. So like I've been in this world of
distributed systems and in general, the systems that communicate to each other for
quite some time. And there are usually two patterns of how you can get something from
other systems. Either the system will push you with some data or you will pull this
information. So as far as I understand from the things that you just explained, there is a
little bit of difference in the patterns of how the telemetry collector and metric bit operate
with the, you know, the Prometheus at this point.
Speaker2: [00:19:10] Correct, yes. So both of them can do the polling part just like
you've described it, but only Metricbeat is actually capable of doing pushing right. So
what I mean by pushing it, it all has to do with this feature called remote right, where the
metric bead can expose an endpoint, and then you can configure your in your
deployment on Kuma service mesh to actually point to that endpoint. And then now
Smash will actually push all those metrics down to that system. So that's for those of
you to know the difference between pulling and pushing. You know that pushing is a
little bit more efficient in terms of data transmissions and doesn't consume a lot of hard
to turn your machine. So it's preferable. Right. But that's that's a significant difference
between the collector and metric beat.
Speaker1: [00:19:56] So, OK, so we now enable this on the Kuma. So we have this
data, we push this data through these adaptors, either as a metric, it and the open
telemetry. Is it time for talk about the ML?
Speaker2: [00:20:09] I think that's a perfect time, Victor, thank you for bringing in them,
so, yeah. So let's talk very quickly about how can you actually leverage these machine
learning features from the plastic into this world of metrics and Kuma Service Mesh. The
first step has to be enable machine learning in your elastic cluster, right? So Elastic
pulls their durability sits on top of Elasticsearch. And as you may know, Elasticsearch
supports machine learning observability built-in. So what you have to do just enable
them and try to beef up the machine to nodes for compute and memory workloads a
little bit because in the end of the day, this is what machine learning is. So imagine if
you were going to operate with a large data set for three months or six months to nodes
are going to spread all those as a distributed system on and. And then if you don't have
a very powerful machine, you are probably going to incur into out of memory things like
this. So you do want to beef up those machines. Secondly, you might want to incur in
some possible data normalization because here's the thing, right? Maybe the data that
is coming from Kuma Service mesh in the Prometheus form, it might not be enough for
your machine learning analysis.
Speaker1: [00:21:18] So, or some system administrators, they will not, you know, either
care enough about putting the exact format. Some different system report different
formats. And this is what we the data engineers would call like a massaging of the data,
right? So we we need to prepare this data to be consumed in a formal format like let us
know if you have some interesting like a war story about, you know, you receive some
of the raw data and how many hoops you need to go through this to prepare this for
ingestion into a machine learning model.
Speaker2: [00:21:55] Exactly, yeah, and if you were watching us, like put it in the shat,
some ward boards that you might have been with this. So in the classic word, what you
can do to do this massaging as Victor calls is using something called a transform. So a
transformed operates like pretty much like a stream processing layer where once
enabled, it will receive the data streams coming from a server smash. And then they will
process them and materialize. A entity that can be used for machine learning purposes.
So it transforms is the way to go for doing this data normalization and then you might
want to before actually enable the actual machine learning algorithms you want to play
with the data. So use the data, visualize or two to read the indices and then visualize.
That's what the name implies. All your data sets and make sure you are like a pursuing
the type of analysis that you want to achieve, right? Ultimately, what you want to do is to
enable the actual algorithm. So the good thing is, there's a bunch of algorithms that you
can pick up from. So first of all, and most probably the most famous ones is going to be
outlier detection. So yeah, you remember those days when you kept staring monitors
and dashboard the whole day looking for some like
Speaker1: [00:23:08] Spikes and, you know, jumps in the memory and things like
memory.
Speaker2: [00:23:12] So this is going to be done now by the machine learning jobs, so
they will be done this job for you. So and they can also classify and do some regression
on those anomalies and put them into boxes so you can classify your anomalies so
because they might be different. And here's the interesting part. Like now you can use
machine learning futures to actually observe the observer. Right? So instead of you
sitting on a chair looking to dead for the whole day now, you would actually look into the
result of the analysis that machines did for you and by machines. I know that I'm
sounding like the Terminator from Skynet, and that sound might be some like a scary a
little bit, but this feature is not scary at all. So with that, I'm going to ask Victor if metrics
is the only thing that we can enable in human service mesh, or maybe there are
something else that we could leverage in this world of machines analyzing data.
Speaker1: [00:24:13] So as I mentioned in a previous few minutes, we do have a
logging here. However, we also can enable some other things in the world. One of the
other things is tracing. So in order to analyze interactions between the services and
interaction between the different systems, we can enable tracing to capture those. So if
the service is calling to another service, you want to know which one, and we can
enable this fairly quickly by enabling this tracing inside a spec for mesh. So next thing,
we also can enable all possible metrics that come from envoy into our mesh, so we can
also capture those and save those inside Prometheus. So now we have metrics we
have with logs and we have tracings. So now. Will this will help? Will these metrics will
help us to do something nice with our data? How this will help in the long run.
Speaker2: [00:25:23] Exactly, so definitely it will help, because remember what we
discussed in the beginning, right, so forget about the pillars of observability. Now you
have to treat your logs, traces and metrics as just types of streams that will come in
different pipes and will be consolidated in a central location, which is going to be elastic
observability ability. And because now they are integrated, correlated and they can talk
to each other. The analogy is that the algorithms can do are going to be more precise.
For example, if you know that a trace that's carrying the information about which
transactions are actually being impacted for some downtime or something like this. Now
you can easily like pinpoint the root cause or where the problem is happening and
where what the problem is, right?
Speaker1: [00:26:07] Yeah. Some of the data is already augmented by trace IDs. We
can extract this from logs and or we can jump right into specific log statement. So that's
kind of neat.
Speaker2: [00:26:19] Yeah. So all you have to do, right, just like we did and then when
Victor enabled the matrix in service mesh and then we brought those metrics into elastic
observability. Now we have to do the same for traces and logs. So let's talk about
traces. What can you do for traces? You can still use the open telemetry collector that
supports the would call receivers, which is the endpoint that are exposed by the
collector. So remember, Victor show you that for sending the traces you have to send to
a place that supports the jaeger protocol, so enable the protocol on the collector. It will
receive those traces and then store in Elastic of their ability. So job done. Very painless.
And as I mentioned before, the open telemetry collector is highly configurable. It can be
deployed as a sidecar. For logs, Victor explain it that Kuma service mesh you can apply
to policy and once enabled, that policy will redirect the traffic of logs to an endpoint.
Expose it in a TCP port that has been published by logs. So what you need to do, you
need to spin up one or multiple instances of log stash over TCP that will be associated
to that policy that victor enable and Kuma service mesh. And then what log stash is
going to do is receive those logs stored in Elastic's observability and then store and
correlate with all the other telemetry data, namely traces and metrics. So that's how you
were going to leverage all of this. And I know that everything that we've discussed this
so far is exciting. But you might be wondering, wow, but I would like to see this. So I'm
going to ask my friend, Victor, what can you do about this victor?
Speaker1: [00:28:01] So since it's in this conference, we have some limited the format
in order to deliver in certain sessions. We thought that would be a good opportunity to
start this conversation and learn a little bit about use cases that you have. And after
that, hopefully we do not. It's an exciting thing that we have some other mediums where
we can connect with our community. And one of the mediums that I'm actively working
here at Kong is this the live stream series called Kong Builders. We put this on hold for
a while while we're working for Kong Summit, but we will resume it week after Kong
Summit and one of the episodes that we're working with. Ricardo will want to run this in
a, you know, more like a relaxed session where we can hack around. And but we need
your input. We need to learn what you want to see. And this is where this content will be
created by builders, for builders. And this is where we're going to be showing you some
of the cool, cool bits that we just described. If you have any other questions or like have
some ideas, don't hesitate to reach out to us. You can find the contacts of us in the
slides of this or information in a Kong Summit page. I hope you will have a very
productive Kong Summit. I really want to say thank you to Ricardo for joining me for the
session.
Speaker2: [00:29:26] The pleasure's all mine, Victor, and was a pleasure to share the
stage with you, and if you have any questions, we will be here on the backstage
answering questions online. But if not, just enjoy the rest of the conference. And Victor,
as always,
Speaker1: [00:29:42] Have a nice day.