# Building Smart O11y for Kuma With Elastic Observability
Viktor Gamov
*This blog was co-created by Ricardo Ferreira (Elastic) and Viktor Gamov (Kong).*
We love our microservices, but without a proper observability (O11y) strategy, they can quickly become cold, dark places cluttered with broken or unknown features. O11y is one of those technologies deemed created by causation: the only reason it exists is that other technologies pushed for it. There wouldn’t be need for O11y if, for example, our technologies haven’t gotten so complex across the years.
It's difficult to pinpoint the root cause of issues when you have to scatter the issue over different machines, VMs and containers and look through code written in other programming languages.
## **Monitoring Systems: Past and Present**
Twenty years ago, we had a very singular approach to monitoring. Web applications consisted of a cluster of HTTP servers, where we could easily SSH them and look at the logs to search for problems. Now, we live in an era with highly distributed systems where the number of servers that build up our clusters is unknown from the top of our heads.
This is like the pet-versus-cattle analogy. It's one thing to deal with many pets you know by name and understand their behaviors. You know what to expect from them. But coping with cattle is different. You don’t know their names because they might pop up every second, and you don’t know their behavior. It is like you are traveling in unknown territory with each of them.
## **What Do We Need for Modern O11y?**
First of all, stop *collecting data* and start *building insight* for your business. When it comes to data, O11y breaks it down into the following known types:
- **Metrics**: Give us insight into what’s going on
- **Tracing**: Points out where problems are
- **Logs**: Explain what the issues are
The most important aspect of a modern O11y strategy is having all this data, known as telemetry data, stored and consolidated into one single platform capable of harnessing their power by applying correlation and causation. Instead of treating each telemetry data as a pillar as most people do—you have to treat them as pipes that ingest data into a place that will make sense out of them. [Elastic Observability](https://www.elastic.co/observability)Elastic Observability is one of the platforms capable of this.
## **Elastic Observability and Kuma Service Mesh**
This blog post will dive into the specific tasks you have to implement to have data from Kuma sent to Elastic Observability to enable the machine learning features to analyze them. This will allow you to accomplish two different objectives.
First, it allows you to eliminate the chaos created by other monitoring systems. By default, Kuma sends metrics to Prometheus/Grafana, traces to Jaeger and logs to Logstash. You can replace all this with Elastic Observability. This simplifies the architecture, reduces the amount of plumbing and reduces the operational footprint required to maintain Kuma.
After that, once all this data is sitting on Elastic Observability, users can use its built-in dashboards and applications to analyze data any time they want. But they can also leverage the platform’s support for machine learning to deploy jobs that can do this work of continuously crushing the numbers for them to focus on the connectivity aspects of Kuma.
apiVersion: kuma.io/v1alpha1
kind: Mesh
metadata: name: default
spec: logging: # TrafficLog policies may leave the `backend` field undefined.
# In that case the logs will be forwarded into the `defaultBackend` of that Mesh.
defaultBackend: file
# List of logging backends that can be referred to by name
# from TrafficLog policies of that Mesh.
backends: - name: logstash
# Use `format` field to adjust the access log format to your use case.
format: '{"start_time":"%START_TIME%","source":"%KUMA_SOURCE_SERVICE%","destination":"%KUMA_DESTINATION_SERVICE%","source_address":"%KUMA_SOURCE_ADDRESS_WITHOUT_PORT%","destination_address":"%UPSTREAM_HOST%","duration_millis":"%DURATION%","bytes_received":"%BYTES_RECEIVED%","bytes_sent":"%BYTES_SENT%"}'
type: tcp
# Use `config` field to co configure a TCP logging backend.
conf: # Address of a log collector.
address:127.0.0.1:5000 - name: file
type: file
# Use `file` field to configure a file-based logging backend.
conf: path: /tmp/access.log
# When `format` field is omitted, the default access log format will be used.
This configuration prepares the logs to be sent to a TCP endpoint running on localhost over port 5000. So we can use [Elastic Stack](https://www.elastic.co/elastic-stack)Elastic Stack to spin up a Logstash instance(s) that exposes the same endpoint. Logstash, in turn, would be responsible for ingesting those logs into Elastic Observability.
Metricbeat also provides options for handling the load and avoiding scraping completely. Instead, we can have Kuma send the metrics to an endpoint exposed by Metricbeat, which leverages a feature from Prometheus called [remote_write](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write)remote_write. This option presents an exciting technique to scale because it is easier and faster to scale up the metric collection layer instead of Kuma.
Like Metricbeat, OpenTelemetry can be deployed as a sidecar, whether on Kubernetes, Docker or bare metal. We can tune in many options to keep up with the load, so it’s highly configurable. It's a great option for those who want to stick with an open standard.
It is essential to know that OpenTelemetry does not yet support the remove_write feature, so if we're looking for a solution that allows us to push metrics to Elastic Observability and handles the load as well, Metricbeat is a far better option. This might change in the future since the OpenTelemetry project is rapidly evolving and catching up with the observability space.
## **Enabling Machine Learning in Elastic Observability**
How can we leverage machine learning features in Elastic for Kuma service mesh?
One of the cool things about the ML support in Elastic Observability is the built-in algorithms for machine learning. There are multiple ones, starting from simple classifications to complex regressions. Because there are many, you may get confused about which one to pick. We can use the [data visualizer](https://www.elastic.co/guide/en/machine-learning/current/ml-gs-visualizer.html)data visualizer tool to load a dataset sample and see how each algorithm mutates and analyzes our data. This is handy because it can happen before we deploy the ML jobs.
### ***Step 4: What Type of Job Do We Want?***
Ultimately, what we want to do is to enable the actual algorithms, like [outlier detection](https://www.elastic.co/guide/en/machine-learning/current/dfa-outlier-detection.html)outlier detection. The ML jobs will do this job for us, and they can classify and do some regression on those anomalies and put them into boxes so we can classify our anomalies because they might be different. Our analysis will become much easier if we correctly enable the algorithms that flush its results into categories that are easy to look over.
### ***Step 5: Observing the Metrics Observer***
Finally, we can use the observer to observe our results. This means that we can configure Elastic Observability to look for specific patterns related to our dataset, such as when the invocation time spent in a request becomes higher than usual for the last hour. We can configure Elastic Observability to look for this for us automatically. Instead of watching it all day, we can automatically trigger an email, an alert, a call on PagerDuty, or even a message on a Slack channel with the on-call team. We call this [alerting](https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html)alerting.
## **Distributed Traces From Kuma to Elastic Observability**
We spoke before about how to enable logs and metrics on Kuma. But there is a third type of signal we can enable on Kuma that helps with our O11y strategy: tracing. Traces help us analyze the interactions between the services and different systems that communicate through Kuma. We can quickly enable this tracing inside the code spec.
In this case, we are enabling the trace collection via the Jaeger protocol and sending those traces to an HTTP endpoint. We also enabled the sampling strategy and configured it to collect 100% of the traces. That means that for each interaction within Kuma, a trace will be collected and emitted. Eventually, this might get a bit chatty, depending on the number of data planes deployed.
This is usually not a problem at the Kuma layer, but it may become one while transmitting this data to Elastic—as both networking and storage may become a bottleneck. Fortunately, we can solve this at the collector level, as we'll dive into next.
To send the traces from Kuma to Elastic Observability, we can use the OpenTelemetry collector. There, we can configure a Jaeger receiver and expose it to the exact host and port that Kuma is configured to send the data to. We will also configure an exporter to send the traces to Elastic Observability. Optionally, we could configure a processor that can buffer/throttle the sending to allow Elastic Observability to receive the data at a pace that it can handle. This is important if we, for example, enabled the sampling strategy in Kuma to 100%, but we don’t want to send all the data to the backend.
Once we integrate our traces, metrics and logs with Elastic, they can be correlated and talk to each other. The algorithms will be more precise. For example, if we know that a trace carries the information about impacted transactions, we can quickly pinpoint the root cause or where the problem occurs.
## **Conclusion**
Now we're ready to change the status quo. We can work smarter by combining the flexibility of Kuma with the power of the Elastic Observability to ingest, store and analyze massive amounts of o11y data. We learned how to collect metrics from Kuma via Prometheus, bring these metrics into Elasticsearch using Metricbeat and create machine learning jobs to look for anomalies that can alert us when something interesting happens.
Look out for a future [Kong Builders](https://konghq.com/kong-builders)Kong Builders episode on this topic, where we'll dive deeper into real use cases on how Kuma and Elastic can be used together to work smarter, not harder. Let us know what you'd like to see on Twitter @riferrei or @gAmUssA.
We're very excited to announce Kong Mesh 2.12 to the world! Kong Mesh 2.12 delivers two very important features: SPIFFE / SPIRE support, which provides enterprise-class workload identity and trust models for your mesh, as well as a consistent Kuma R
As an application developer, have you ever had to troubleshoot an issue that only happens in production? Bugs can occur when your application gets released into the wild, and they can be extremely difficult to debug when you cannot reproduce without
The more services you have running across different clouds and Kubernetes clusters, the harder it is to ensure that you have a central place to collect service mesh observability metrics. That's one of the reasons we created Kuma , an open source
A year ago, Harry Bagdi wrote an amazingly helpful blog post on observability for microservices. And by comparing titles, it becomes obvious that my blog post draws inspiration from his work. To be honest, that statement on drawing inspiration fro
AI observability extends traditional monitoring by adding behavioral telemetry for quality, safety, and cost metrics alongside standard logs, metrics, and traces Time-to-First-Token (TTFT) and token usage metrics are critical performance indicator
The convergence reality Organizations are gradually realizing that API and AI observability aren't two separate problems; they're intertwined and require unified solutions. Without waiting on engineering, it's hard to answer simple questions like "
Kong is excited to announce Solace as the newest member of our Premium Technology Partner Program, a program designed to deliver high-quality, reliable integrations that provide real business value for customers. Together, Kong and Solace unify AP