Monitoring and distributed tracing enable observability of your production system and contribute to the feedback loop for a more efficient and effective development process.
Monitoring the health of your production system involves keeping track of various data points in real time in order to derive insights from them. Day to day, monitoring can provide early indications of problems, giving the team time to investigate and fix before a system fails completely. If you’re running tests in production, such as canary releases or blue-green deployments, a monitoring tool is essential for measuring the impact of the release so you can decide whether to roll back or roll out.
What Should You Monitor?
- If your services are hosted by a third-party cloud provider, monitoring system metrics such as disk space, I/O, CPU and memory usage may seem unnecessary. However, these data points are a good indicator of the health of your services and observing them just after changes have been released can tell you a lot about how your code performs in the wild.
- Measuring how long it takes for a request from a client to be processed, including the time taken by individual services involved in the request, gives you a performance baseline. An increase in latency not only degrades the experience for users but can also indicate an underlying issue.
- Monitoring the types of error that are being generated by the system can indicate problems in the behavior of specific services or in the way clients are interacting with the system. While distributed tracing can be used to track down the former, the latter may indicate an issue with the public API or its documentation.
- It’s good practice to include business KPIs in your observed metrics. While these may seem too high-level to be of any use to developers, changes from normal patterns can signal an underlying failure. A decrease in the number of purchases, signups or data views may be due to a problem with functionality or performance that is preventing or deterring users from using your system.
Monitoring Microservices from the Gateway
Most monitoring solutions provide a dashboard so you can see the state of your system and configurable alerts to notify you when values pass a specified threshold. When setting alerts, keep in mind the signal-to-noise ratio; if a threshold is set too low, teams soon learn to ignore them and will miss the important ones. Over time, you may get to know the signs that mean something is likely to fail, and you can adjust the thresholds accordingly to give you time to take pre-emptive action.
In a microservices system, traffic is typically funneled through an API gateway which provides access to the upstream services. This makes the gateway a good vantage point for integrating a monitoring solution. Kong’s API gateway supports a number of plugins for monitoring the health of your system with options to store the data yourself or send it to a hosted service. Kong Enterprise includes Kong Vitals to monitor metrics for upstream microservices, such as request counts and status codes.
Distributed Microservices Tracing
While monitoring collects metrics and aggregates data over time to show trends, distributed tracing focuses on a single operation. Although originally developed as a way of measuring and improving performance, distributed tracing is also valuable when debugging in a microservice-based system.
Unlike a monolithic application where developers can use a stack trace to understand the context of an error, with microservices, a single request can spawn multiple requests to individual services hosted on different machines. If a failure occurs at some point in the system, every interaction between that service and other parts of the system is a possible cause, and looking at the individual log files won’t give you the full context. Likewise, any increase in latency could be attributable to any or even several components. In order to debug the issue, you need to be able to piece together the chain of requests that led to it. Distributed tracing makes this possible.
With distributed tracing, an identifier is applied to every request coming into the system and propagated to each request sent to the individual microservices as well as any child requests that they generate. To provide a complete trace, the identifier should also be included in the response sent to the client. This means that if a user reports an error, the identifier in the response they received can be used to replay the entire transaction, locate the relevant logs and identify the root cause.
It’s best to implement distributed tracing early on, as it requires instrumentation of all the code in your system. Missing a span (i.e., any one of the individual segments in a transaction) will create a blind spot in your trace, which can make it harder to get to the bottom of a problem. With Kong’s API gateway, you can enable distributed tracing with the Zipkin plugin, which adds identifiers to requests and reports data to a Zipkin server.
Populating the Pyramid
One of the advantages of microservices over a monolithic architecture is the ability to deploy changes quickly. By identifying issues in production proactively and being able to debug the cause of the problem quickly, you can react and roll out fixes (or roll back changes) promptly and minimize the financial or reputational damage from a failure. Having identified the cause of the problem and deployed a fix, the next step is to add an automated test to prevent a similar issue occurring in the future. Creating a system of continuous feedback and improvement helps to build a more robust product and increases confidence in your system, without slowing down the development and deployment process. That in turn means that teams can continue to innovate and respond to evolving user needs, ensuring your product remains valuable to your users.