Tracing, Logging, Metrics: Unifying Observability with OpenTelemetry

April 8, 2025

8 min read

Madan Thangavelu

Sr. Director of Engineering, Uber

Software development has always evolved with new paradigms to meet the growing demands of modern systems. One of the most significant shifts has been the adoption of microservices. Emerging in the early 2010s, this architectural pattern moved away from monolithic applications in favor of smaller, independent services that interact with each other over a network. With this shift came a need for new tools and frameworks for every stage of the development lifecycle — mature frameworks like Kubernetes/Docker for container management, API gateways to manage ingress and egress of cloud applications, and many more that have become industry norms.

One such framework that has quickly gained traction in the cloud-native world is OpenTelemetry. An open-source observability standard, OpenTelemetry is revolutionizing how developers approach monitoring, tracing, and logging across microservices architectures.

In this post, we'll dive into the purpose and components of OpenTelemetry, explore how it simplifies instrumentation, and highlight why it's becoming the de facto standard for observability in modern cloud-native applications.

What is distributed tracing?

Distributed tracing is a method used to track the flow of a request as it travels through multiple services in a distributed system. It's a crucial tool for understanding the performance and behavior of modern applications, especially those built using microservices architecture, where requests may pass through dozens or even hundreds of different services before returning a response.

History of distributed tracing

The concept of tracing across distributed systems was first introduced by Dapper, a large-scale distributed tracing system developed by Google in 2010. Following this, Twitter open-sourced Zipkin in 2015, which became one of the first widely adopted tracing solutions.

Early tracing systems, much like performance tracing in standalone applications, focused primarily on diagnosing latency and performance issues in distributed systems. Over time, tracing evolved beyond performance monitoring, expanding to areas such as distributed logging, context propagation, and aggregate analysis, which we’ll explore further in this article.

Meanwhile, Google open-sourced OpenCensus, a set of libraries for collecting distributed metrics and traces. Around the same time, OpenTracing emerged, intending to develop a standardized API for trace logging compatible with a variety of distributed tracing tools. In 2020, OpenCensus and OpenTracing merged to create OpenTelemetry, a unified framework that combines metrics, traces, and logs into a single, powerful observability solution.

Distributed tracing 101

When an API request enters the cloud — typically through a gateway or load balancer — it begins its journey across multiple services before eventually returning a response to the client, which could be software running on a device or a mobile phone.

To visualize this process centrally, imagine a call pattern that looks something like this:

In complex business applications, this represents potential routes a request might take. However, it's crucial to establish the exact call path of a specific request. For example, if a user service exposes a /get-user API, it shouldn't call the /get-membership-details API unless the user is a member.

Here’s a representation of all possible call paths for such a request:

This visualization shows three potential call graphs of a single request:

To accurately reconstruct the outcome of any given request, we need a distributed tracing system. Each service (A, B, C, etc.) needs to provide context to the tracing system about the work it performed. The distributed tracing system then compiles all this information to recreate the full request path, allowing us to visualize the precise flow of any individual request through the system.

Key concepts of distributed tracing

Trace (traceID): A trace represents the complete journey of a single request across all the services involved in processing it. It provides a high-level view of the path the request takes through your system. Each trace has a unique traceID associated with it.
Span (spanID): A trace is composed of multiple spans. A span represents a single unit of work within a service and has a unique spanID, such as processing an API request, interacting with a database, or calling an external API. Each span is part of a larger trace. A span can also add many spanAttributes to the context.
Context propagation: As a request moves from one service to another, context (like a traceID and spanID) is passed along, allowing each service to associate its part of the process with the overall trace. This enables you to track the request's entire journey.
Latency and performance insights: By visualizing the spans within a trace, you can identify bottlenecks, performance issues, or failures in specific services. For example, if a request is slow, tracing allows you to pinpoint which service is causing the delay.

Learn More Get a Demo

API gateways and novel use of distributed tracing

API gateways play a crucial role in distributed tracing, especially in large-scale cloud-native applications. As the entry point for most client requests, the API gateway is often the first system to receive and process an incoming request. In many cases, the full context of the request is only captured at this stage.

For effective observability, it's essential to start a trace and assign a traceID right at the API gateway. Once the traceID is assigned, along with an API identifier, all downstream activities and interactions of that request can be tracked and linked back to the initial point of entry into the system. This ensures that the entire journey of the request, from start to finish, is captured in a single, cohesive trace.

In large companies with hundreds of microservices, distributed tracing is a foundational part of the infrastructure for tracking performance and optimizing system reliability. But beyond tracing, the concept of context propagation has emerged as a powerful tool that extends its benefits to many other areas of application development.

Context propagation refers to the ability to inject metadata into a request, which is then carried across various systems (e.g., message queues, stream processing, API calls) or protocols (such as gRPC or HTTP). As the request moves through the system, each service or component adds its own additional context to the metadata, allowing the entire journey of the request to be understood in a more nuanced way.

This capability has led to a variety of new use cases, including:

End-to-end (E2E) testing: By tagging incoming requests with an "E2E test" marker in the context propagation baggage, different systems can handle these requests in special ways. For example, some services may respond with mock data, prevent unintended side effects, or apply special processing logic. This makes it possible to run comprehensive E2E tests across your entire stack, simulating real-world conditions without affecting production systems.
Routing delegates: If your microservices architecture needs to route traffic differently based on specific attributes of the customer — such as geographic location, customer type, or other demographic factors — you can inject this information into the telemetry baggage. This allows the system to dynamically route requests to different environments or clusters based on the context, providing more tailored processing paths for different segments of your user base.

The ability to propagate context throughout your entire system opens up a wide range of possibilities for more intelligent and efficient system behavior. As the technology continues to evolve, we’re likely to see even more innovative applications of context propagation across various domains.

A powerful evolution to include logs and metrics

When distributed tracing first emerged, the primary focus was on tracking request flow and measuring latency through the software stack. Over time, however, the industry recognized that combining tracing with logging and metrics significantly enhances observability. When a traced request is identified as slow or underperforming, the next logical step is to dive deeper into debugging the root cause.

To achieve this, you need to correlate the trace with the relevant logs and metrics generated during the lifecycle of that API request. While OpenTracing itself doesn't handle the collection and storage of logs and metrics, it provides the necessary APIs and exporters to integrate important metadata — such as traceID, spanID, and additional span attributes — into other observability dimensions like logs and metrics. This integration allows for more comprehensive insights and faster troubleshooting.

OpenTelemetry’s role in simplifying instrumentation

Before OpenTelemetry, monitoring and debugging distributed systems required multiple (often incompatible) tools and frameworks. The effort to instrument services was cumbersome and fragmented, especially when integrating different observability solutions for metrics, traces, and logs.

OpenTelemetry solves this problem by providing a single, unified framework for collecting telemetry data, drastically simplifying the instrumentation process. Instead of having to manually configure and manage separate solutions for tracing, logging, and monitoring, OpenTelemetry streamlines the process by providing a cohesive approach that works across multiple services and languages.

Major components of OpenTelemetry

Vendor-agnostic API: A specification in multiple programming languages that provides standardized interfaces for emitting telemetry data, making it easy to instrument applications regardless of the vendor or underlying platform.
SDK (software development kit): A set of libraries and tools that implement the vendor-agnostic API in various languages, allowing developers to easily integrate telemetry data collection into their code.
Collectors: A vendor-neutral component that acts as a proxy to receive, process, and export telemetry data. Collectors can accept data in various formats — such as OTLP, Jaeger, Prometheus, and others — and can send this data to multiple backend observability systems for further analysis.
Exporters: Components that transmit telemetry data from the collector to backend systems such as Jaeger, Zipkin, Prometheus, or other vendor-specific observability platforms, enabling centralized monitoring and analysis.

Interoperability and ecosystem growth

A key benefit of OpenTelemetry is its ability to integrate with a wide range of observability tools. Since it's designed to be vendor-agnostic, it allows organizations to use their preferred backends for analysis and visualization. Whether you’re using Prometheus for metrics, Jaeger for tracing, or Elasticsearch for logs, OpenTelemetry can export your data to any of these systems and more.

The OpenTelemetry project has seen rapid growth and adoption, with major cloud-native platforms like Kubernetes and cloud providers like AWS, Google Cloud, and Microsoft Azure offering support for OpenTelemetry-based instrumentation. As the ecosystem continues to expand, more integrations and out-of-the-box solutions are becoming available, further cementing OpenTelemetry’s place as the go-to observability framework for modern applications.

Industry adoption

Since its introduction, OpenTelemetry has quickly gained adoption across industries, from small startups to large enterprises. Companies building cloud-native applications recognize the value of having a standardized approach to observability, and OpenTelemetry offers just that.

By adopting OpenTelemetry, organizations can gain better insights into the performance and health of their microservices architectures, leading to faster troubleshooting, improved system reliability, and better user experiences.

Conclusion

As distributed systems continue to evolve, the need for robust observability tools is more critical than ever. OpenTelemetry stands at the forefront of this shift, offering a powerful, open-source framework that simplifies instrumentation and enhances interoperability across the entire observability stack. By adopting OpenTelemetry, organizations can gain deep insights into their applications, troubleshoot issues faster, and ensure systems run smoothly in production.

If you’re looking to future-proof your observability strategy, OpenTelemetry is the framework you need to explore.

About the Author

Madan Thangavelu is a Kong Champion and Sr. Director of Engineering at Uber, with over 15 years in the software industry and extensive experience across fintech, mobile security, and large-scale consumer applications. Madan founded Uber’s API platform, processing millions of concurrent trips worldwide. In his current role, he leads the development of the flagship Uber Rider app and oversees major business platforms including fulfillment, fares, and API frameworks. "As the engineering lead for API platforms at Uber, I enjoy sharing our learnings with the API community. Kong has a fantastic community that can benefit from such information."

Interested in becoming a Kong Champion or learning more about the program? Visit the Kong Champions page.