Kong Konnect Data Plane Node Autoscaling with Cluster Autoscaler on Amazon EKS 1.29

February 19, 2024

4 min read

Claudio Acquaviva

Principal Architect, Kong

After getting our Konnect Data Planes vertically and horizontally scaled, with VPA and HPA, it's time to explore the Kubernete Node Autoscaler options. In this post, we start with the Cluster Autoscaler mechanism. (Part 4 in this series is dedicated to Karpenter.)

Cluster Autoscaler

The Kubernetes Cluster Autoscaler documentation describes it as a tool that automatically adjusts the size of the Kubernetes Cluster when one of the following conditions is true:

There are pods that failed to run in the cluster due to insufficient resources.
There are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.

The following Cluster Autoscaler diagram was taken from the AWS Best Practices Guides portal.

Amazon EKS Cluster Autoscaler

The Amazon EKS Cluster Autoscaler implementation relies on the EC2's Autoscaling Group capability to manage NodeGroups. That is the reason why we used the eksctl --asg-access parameter when we created our EKS Cluster. Check the eksctl autoscaling page to know more.

In fact, you can check all ASGs your EKS Cluster has in place with the following command, which asks for the regular MinSize, MaxSize, and DesiredCapacity values.

The expected output should look like this. As shown, each NodeGroup has its own ASG defined. Of course, we are interested in the kong NodeGroup.

EKS Cluster Autoscaler and IRSA (IAM Roles for Service Accounts)

EKS Cluster Autoscaler is deployed and it runs as a Kubernetes Deployment. As such, it's recommended to use IRSA (IAM Roles for Service Accounts) to associate the Service Account that the Cluster Autoscaler Deployment runs as with an IAM role that is able to perform these functions.

IAM OIDC provider

IRSA requires the EKS Cluster to be associated with an IAM OpenID Connect Provider. This can be done with a specific eksctl command:

You can check the IAM OIDC provider with the following command. You should see your EKS Cluster OIDC Issuer endpoint listed as an ARN.

If you will, you can check the OIDC Issuer endpoint with:

IAM Policy

IRSA is based on a Kubernetes Service Account and IAM Role pair. In turn, the Kubernetes Deployment should refer to the Service Account, which has the IAM Role set as an Annotation. Finally, the IAM Role allows the Deployment to access AWS Services, including, in our case, Auto Scaling Group (ASG) services.

Here is an IAM Policy example that allows Roles to access the ASG Services:

Service Account and IAM Role

With the IAM Policy created we can run another eksctl command to create both Kubernetes Service Account and IAM Role:

If you check the Service Account, you will see it has the required annotation referring to the Role created by the iamserviceaccount command:

You can check the Role also.

As expected the Role refers to our Policy:

Deploy Cluster Autoscaler with Helm

Now we are ready to install Cluster Autoscaler in our EKS Cluster. This can be done using its Helm Charts. Note the command refers to the Service Account and IAM Role created previously.

You can check the Cluster Autoscaler log files to see how the installation went:

Check Cluster Autoscaler at work

With the existing Load Generator still running (if you don't have it, please start one), you should see a new status for both ASG and the Konnect Data Plane Kubernetes Deployment:

ASG has requested a new instance of the Node:

And then, the Pending Pods, after getting scheduled, are finally running:

Submit the Konnect Data Plane to an even higher throughput

So, what happens if we submit our Konnect Data Plane Deployment to a longer and higher throughput? For instance, we change the HPA policy like this, allowing up to 20 replicas to run:

Also, we can submit a heavier load test with:

The expected result is to have extra Nodes getting created along the way. Here are the HPA, ASG and Kubernetes Nodes outputs:

The problem is that all Nodes are based on the same original t3.large Instance Type. You can check them out with:

We could address that by deploying multiple instances of Cluster Autoscaler, each configured to work with a specific set of Node Groups. However, that's not recommended, as Cluster Autoscaler was not designed for this kind of configuration and it could lead to a situation where multiple Autoscaler would attempt to schedule a Pod. Please, check the EKS Best Practices Guide for Cluster Autoscaler provided by AWS.

That's one of the main reasons to switch to a more flexible and smarter cluster autoscaler as Karpenter implements. Let's check it out in the next (and final) part of this series, Kong Konnect Data Plane Node Autoscaling with Karpenter on Amazon EKS 1.29.

Topics:API Testing

AWS