Engineering
August 22, 2024
9 min read

Using Service Mesh Within Your Kubernetes Environment

Kong

Container technologies are always evolving — and we're not talking Tupperware here. Over the past years, service mesh has emerged as a crucial component for managing complex, distributed systems. As organizations increasingly adopt Kubernetes for orchestrating their containerized applications, understanding how to effectively implement and utilize a service mesh within this environment becomes paramount. This guide will explore the intricacies of service mesh adoption in Kubernetes, its benefits, and practical steps for implementation.

Understanding service mesh in Kubernetes

When we think about technologies, containers and container orchestration came first. One of the things that came after was the idea of a Service Mesh. Traditionally, when we think about Load Balancing, we think about North-South load balancing. This means having a user somewhere that puts traffic into a gateway, which might be an ingress in Kubernetes, leading down to your pods.

However, what a service mesh is actually talking about is more like East-West Load balancing. This means facilitating communication between different services within your cluster. For example, if you have a service S1 that needs to talk to another service S2, the service mesh handles how that traffic flows between those two things.

Learn more about service mesh adoption strategies at Kong's API Summit virtual event.

Key features of service mesh

Service mesh brings several valuable features to the table. One of the primary features is the use of sidecars. Many service meshes are installed as sidecars, which means that alongside your main application container in a pod, you have a mesh sidecar - a separate container within the same pod. These containers talk to each other via localhost. This approach means that as an application developer, you don't have to think much about how to integrate the service mesh into your application. You simply add its container to your pod definition, and the services show up automatically.

Another key feature is service authorization. Service mesh provides a notion of authorization, allowing you to define which services can talk to each other. This is useful for establishing least privilege and preventing accidents, such as a development service accidentally putting too much traffic onto a production instance.

Service mesh also enables you to conduct experiments or canary deployments. For instance, you could route 99% of traffic to your production service and 1% to a new version for testing. This allows you to minimize the impact of changes before completing a full rollout.

Lastly, the service mesh proxy can collect a variety of metrics such as latency, HTTP errors, and user agents. These metrics can be pushed to monitoring systems like Azure Monitoring or open-source projects like Prometheus, providing valuable insights into your service behavior and performance.

Real-world scenario: Preventing outages

To illustrate the practical benefits of service mesh, let's consider a real-world scenario. A large e-commerce retailer experienced a significant outage in their payment system due to issues with their Kubernetes ingress controller. This particular retailer offered a special payment method for their loyal users, allowing them to pay via invoice. Before offering this payment method, a risk assessment had to be completed to determine whether the user qualified.

During the outage, the payment option was completely lost, and users were unable to submit orders or make payments through the e-commerce platform. The root cause? The invoice payment method relied on a risk assessment service that was served through the Kubernetes Ingress controller. During the incident, they observed many HTTP 5xx errors, particularly HTTP 502 bad gateway errors, indicating that the proxy and traffic handling were unresponsive or unavailable.

Further investigation revealed that the Ingress controller became unavailable because it was killed by an out-of-memory event by Kubernetes. This occurred because another application in the shared environment was also behind the same Ingress controller. With the increase in load, the Ingress began to queue requests, utilizing more and more threads and resources until it eventually hit the memory limit.

This scenario highlights the importance of proper resource management in Kubernetes and the potential benefits of using a service mesh to prevent such outages.

Implementing service mesh features

Let's delve deeper into some key features of service mesh that could have prevented or mitigated the issue in our real-world scenario:

  • Request Timeout: When we implement the service mesh and define a timeout threshold in our spec, any requests taking longer than this threshold will be automatically terminated, and an error will be returned to the upstream client. This protects services from being impacted by other slow or unhealthy services. In our retailer's example, this feature would have prevented the slow requests from causing a queue buildup and ultimately killing the Ingress controller.
  • Rate Limit: Using a tool like the Envoy filter, we can set a certain number of connections to a particular downstream service. Any traffic to containers above that rate limit during a specified timeframe will be blocked. This is particularly useful for security use cases, such as mitigating DDoS attacks. It also helps protect services from unexpected load increases.
  • Circuit Breaker: This feature limits traffic to pods that have been deemed unhealthy. Once a pod becomes healthy again and can handle requests, the service mesh will automatically start to reroute traffic to it. This ensures that our load balancer doesn't oversaturate pods that are already overwhelmed.

These features work together to create a more resilient and stable system. The request timeout prevents slow services from impacting others, the rate limit protects against unexpected traffic spikes, and the circuit breaker ensures that traffic is only routed to healthy pods.

Additional service mesh capabilities

Beyond these core features, service mesh offers several other capabilities that complement microservices architecture:

  • Retry Logic: When a request fails from one service to another, the service mesh can automate the retry of those requests. This improves the overall reliability of your system by handling transient failures gracefully.
  • Authentication: Service mesh can manage repeated access across different services and handle certificate management for client and server authentication. This centralizes and simplifies security management across your microservices.
  • Traffic Split: This feature allows you to manage where and how traffic gets routed, controlling the distribution of requests against different endpoints. This is particularly useful for implementing canary releases or blue-green deployments.

Observability in service mesh

One of the significant advantages of using a service mesh is the enhanced observability it provides. Service meshes typically offer observability signals such as metrics, traces, and logs, which can be sent to your observability platform for analysis, alerting, and automation.

The health of the service mesh itself and the control plane is usually observed through provided Prometheus metrics. The service mesh activity related to the actual applications is observed through metrics, traces, and logs produced by the service mesh sidecar container proxies.

Distributed tracing allows you to follow transactions end-to-end, from the entry point into the cluster through the Ingress, and then through the microservices architecture. For example, you can see how traffic first hits the Ingress Gateway, then is routed to the frontend microservice through the Ingress Service within the service mesh. As the transaction flows through the architecture, you can observe the egress and ingress of each respective service.

These traces provide rich metadata about each transaction, such as protocols used, HTTP response codes, and any error messages or context attributes. This level of detail allows you to inspect what is happening, how transactions are being routed, and how much time is being taken at each hop.

Implementing canary releases with service mesh

Service mesh makes it easy to implement canary release strategies when deploying new versions of your microservices. This allows you to limit the number of requests that hit the new version, enabling you to test and validate performance improvements without impacting all users and transactions.

For example, you might implement an 80/20 traffic split between the old and new versions of a service. In this scenario, you could observe a transaction making three calls to the Product Catalog service, where two hit the original version with a 1-second response time, while the third call to the new version has a much faster response time of mere milliseconds.

Once you've validated the new release and its performance improvement, you can simply modify your traffic rules to redirect all traffic to the new version. Your observability data will then confirm the improved response time and increased throughput.

Deployment and configuration of service mesh

The configuration of the service mesh itself is performed through the control plane, typically done through the use of an operator. This allows you to control which namespaces in your cluster are utilizing the service mesh and how they're using it.

For more specific configuration of capabilities, you can use Custom Resource Definitions (CRDs) applied to the target namespace you're trying to control with the service mesh. As proxy containers are deployed into the pods via the control plane, all the rule sets that help manage your microservices architecture can be applied in real-time.

Conclusion

Adopting a service mesh in your Kubernetes environment can significantly enhance your ability to manage, secure, and observe your microservices architecture. It allows you to utilize key microservice architecture capabilities in an automated fashion, alleviating your application developers from having to code these features into the microservices themselves.

Remember, managing resources in Kubernetes is extremely important, especially when dealing with large-scale implementations with many apps and workloads sharing resources on a cluster. By properly defining quotas, requests, and limits, you can mitigate the impact of heavy workloads and resource consumption on other containers running in that environment.

Finally, while the service mesh provides crucial observability signals, these signals on their own are not enough. You need a powerful observability backend that can enrich, correlate, and analyze this data, preferably with the help of artificial intelligence. This combination of service mesh and advanced observability will set you up for success in your Kubernetes and microservices journey.

As you embark on your service mesh adoption journey, remember that it's a powerful tool that can help you build more reliable applications. By leveraging features like service authorization, canary deployments, and comprehensive metrics collection, you can create a more robust, secure, and observable microservices architecture. The key is to start small, understand your specific needs, and gradually expand your use of service mesh features as you become more comfortable with the technology.

Service mesh Kubernetes FAQs

Q: What is a service mesh in Kubernetes?

A: A service mesh in Kubernetes is a technology that manages communication between different services within a cluster, focusing on East-West load balancing. It handles traffic flow between services, providing features like sidecars, service authorization, and metrics collection.

Q: What are the key features of a service mesh?

A: Key features of a service mesh include sidecars for easy integration, service authorization for defining which services can communicate, canary deployments for testing new versions, and metrics collection for monitoring service behavior and performance.

Q: How can a service mesh prevent outages in Kubernetes?

A: A service mesh can prevent outages by implementing features like request timeouts to terminate long-running requests, rate limits to control traffic flow, and circuit breakers to limit traffic to unhealthy pods. These features work together to create a more resilient and stable system.

Q: What is the role of observability in service mesh?

A: Observability in service mesh provides enhanced insights through metrics, traces, and logs. It allows you to follow transactions end-to-end, observe how traffic is routed, and inspect response times and error messages, enabling better monitoring and troubleshooting of your microservices architecture.

Q: How does a service mesh facilitate canary releases?

A: A service mesh makes it easy to implement canary releases by allowing you to control traffic distribution between different versions of a service. For example, you can split traffic 80/20 between old and new versions, enabling you to test and validate performance improvements without impacting all users.

Q: How is a service mesh deployed and configured in Kubernetes?

A: A service mesh is typically deployed and configured through the control plane using an operator. Custom Resource Definitions (CRDs) are used for more specific configurations in target namespaces. The control plane deploys proxy containers into pods, applying rule sets in real time to manage the microservices architecture.