By on December 19, 2017

How to Design a Scalable Rate Limiting Algorithm

Rate limiting protects your APIs from overuse by limiting how often each user can call the API. This protects them from inadvertent or malicious overuse. Without rate limiting, each user may request as often as they like, which can lead to “spikes” of requests that starve other consumers. After rate limiting is enabled, they are limited to a fixed number of requests per second.


In the example chart, you can see how rate limiting blocks requests over time. The API was initially receiving 4 requests per minute shown in green. When rate limiting was enabled at 12:02, additional requests shown in red are denied.

Rate limiting is very important for public APIs where you want to maintain a good quality of service for every consumer, even when some users take more than their fair share. Computationally-intensive endpoints are particularly in need of rate limiting – especially when served by auto-scaling, or by pay-by-the-computation services like AWS Lambda and OpenWhisk. You also may want to rate limit APIs that serve sensitive data, because this could limit the data exposed if an attacker gains access in some unforeseen event.

There are actually many different ways to enable rate limiting, and we will explore the pros and cons of different rate limiting algorithms. We will also explore the issues that come up when scaling across a cluster. Lastly, we’ll show you an example of how to quickly set up rate limiting using Kong, which is the most popular open-source API gateway.


Rate Limiting Algorithms

There are various algorithms for rate limiting, each with their own benefits and drawbacks. Let’s review each of them so you can pick the best one for your needs.


Leaky Bucket

Leaky bucket (closely related to token bucket) is an algorithm that provides a simple, intuitive approach to rate limiting via a queue which you can think of as a bucket holding the requests. When a request is registered, it is appended to the end of the queue. At a regular interval, the first item on the queue is processed. This is also known as a first in first out (FIFO) queue. If the queue is full, then additional requests are discarded (or leaked).



The advantage of this algorithm is that it smooths out bursts of requests and processes them at an approximately average rate. It’s also easy to implement on a single server or load balancer, and is memory efficient for each user given the limited queue size.

However, a burst of traffic can fill up the queue with old requests and starve more recent requests from being processed. It also provides no guarantee that requests get processed in a fixed amount of time. Additionally, if you load balance servers for fault tolerance or increased throughput, you must use a policy to coordinate and enforce the limit between them. We will come back to challenges of distributed environments later.


Fixed Window

In a fixed window algorithm, a window size of n seconds (typically using human-friendly values, such as 60 or 3600 seconds) is used to track the rate. Each incoming request increments the counter for the window. If the counter exceeds a threshold, the request is discarded. The windows are typically defined by the floor of the current timestamp, so 12:00:03 with a 60 second window length, would be in the 12:00:00 window.




The advantage of this algorithm is that it ensures more recent requests gets processed without being starved by old requests. However, a single burst of traffic that occurs near the boundary of a window can result in twice the rate of requests being processed, because it will allow requests for both the current and next windows within a short time. Additionally, if many consumers wait for a reset window, for example at the top of the hour, then they may stampede your API at the same time.


Sliding Log

Sliding Log rate limiting involves tracking a time stamped log for each consumer’s request. These logs are usually stored in a hash set or table that is sorted by time. Logs with timestamps beyond a threshold are discarded. When a new request comes in, we calculate the sum of logs to determine the request rate. If the request would exceed the threshold rate, then it is held.



The advantage of this algorithm is that it does not suffer from the boundary conditions of fixed windows. The rate limit will be enforced precisely. Also, because the sliding log is tracked for each consumer, you don’t have the stampede effect that challenges fixed windows. However, it can be very expensive to store an unlimited number of logs for every request. It’s also expensive to compute because each request requires calculating a summation over the consumer’s prior requests, potentially across a cluster of servers. As a result, it does not scale well to handle large bursts of traffic or denial of service attacks.


Sliding Window

This is a hybrid approach that combines the low processing cost of the fixed window algorithm, and the improved boundary conditions of the sliding log. Like the fixed window algorithm, we track a counter for each fixed window. Next, we account for a weighted value of the previous window’s request rate based on the current timestamp to smooth out bursts of traffic. For example, if the current window is 25% through, then we weight the previous window’s count by 75%. The relatively small number of data points needed to track per key allows us to scale and distribute across large clusters.



We recommend the sliding window approach because it gives the flexibility to scale rate limiting with good performance. The rate windows are an intuitive way she to present rate limit data to API consumers. It also avoids the starvation problem of leaky bucket, and the bursting problems of fixed window implementations.


Rate Limiting in Distributed Systems


Synchronization Policies

If you want to enforce a global rate limit when you are using a cluster of multiple nodes, you must set up a policy to enforce it. If each node were to track its own rate limit, then a consumer could exceed a global rate limit when requests are sent to different nodes. In fact, the greater the number of nodes, the more likely the user will be able to exceed the global limit.

The simplest way to enforce the limit is to set up sticky sessions in your load balancer so that each consumer gets sent to exactly one node. The disadvantages include a lack of fault tolerance and scaling problems when nodes get overloaded.

A better solution that allows more flexible load-balancing rules is to use a centralized data store such as Redis or Cassandra. This will store the counts for each window and consumer. The two main problems with this approach are increased latency making requests to the data store, and race conditions, which we will discuss next.


Race Conditions

One of the largest problems with a centralized data store is the potential for race conditions in high concurrency request patterns. This happens when you use a naïve “get-then-set” approach, wherein you retrieve the current rate limit counter, increment it, and then push it back to the datastore. The problem with this model is that in the time it takes to perform a full cycle of read-increment-store, additional requests can come through, each attempting to store the increment counter with an invalid (lower) counter value. This allows a consumer sending a very high rate of requests to bypass rate limiting controls.



One way to avoid this problem is to put a “lock” around the key in question, preventing any other processes from accessing or writing to the counter. This would quickly become a major performance bottleneck, and does not scale well, particularly when using remote servers like Redis as the backing datastore.

A better approach is to use a “set-then-get” mindset, relying on atomic operators that implement locks in a very performant fashion, allowing you to quickly increment and check counter values without letting the atomic operations get in the way.


Optimizing for Performance

The other disadvantage of using a centralized data store is increased latency when checking on the rate limit counters. Unfortunately, even checking a fast data store like Redis would result in milliseconds of additional latency for every request.

In order to make these rate limit determinations with minimal latency, it’s necessary to make checks locally in memory. This can be done by relaxing the rate check conditions and using an eventually consistent model. For example, each node can create a data sync cycle that will synchronize with the centralized data store. Each node periodically pushes a counter increment for each consumer and window it saw to the datastore, which will atomically update the values. The node can then retrieve the updated values to update it’s in-memory version. This cycle of converge → diverge → reconverge among nodes in the cluster is eventually consistent.



The periodic rate at which nodes converge should be configurable. Shorter sync intervals will result in less divergence of data points when traffic is spread across multiple nodes in the cluster (e.g., when sitting behind a round robin balancer), whereas longer sync intervals put less read/write pressure on the datastore, and less overhead on each node to fetch new synced values.


Quickly Set Up Rate Limiting with Kong


Kong is an open source API gateway that makes it very easy to build scalable services with rate limiting. It’s used by over 300,000 active instances globally. It scales perfectly from single Kong nodes to massive, globe-spanning Kong clusters.

Kong sits in front of your APIs and is the main entry-point to your upstream APIs. While processing the request and the response, Kong will execute any plugin that you have decided to add to the API.



Kong’s rate limiting plug-in is highly configurable. It offers flexibility to define multiple rate limit windows and rates for each API and consumer. It includes support for local memory, Redis, Postgres, and Cassandra backing datastores. It also offers a variety of data synchronization options including synchronous and eventually consistent models.

You can quickly install Kong on one of your dev machines to test it out. My favorite way to get started is to use the AWS cloud formation template since I get a pre-configured dev machine in just a few clicks. Just choose one of the HVM options, and set your instance sizes to use t2.micro as these are affordable for testing. Then ssh into a command line on your new instance for the next step.


Adding an API on Kong

The next step is adding an API on Kong using Kong’s admin API. We will use httpbin as our example, which is a free testing service for APIs. The get URL will mirror back my request data as JSON. We also assume Kong is running on the local system at the default ports.

Now Kong is aware that every request sent to “/test” should be proxied to httpbin. We can make the following request to Kong on its proxy port to test it:

It’s alive! The request has been received by Kong and proxied to httpbin, which has mirrored back the headers for my request and my origin IP address.

Adding Basic Rate-Limiting

Let’s go ahead and protect it from an excessive number of requests by adding the rate-limiting functionality using the community edition Rate-Limiting plugin, with a limit of 5 requests per minute from every consumer:

If we now make more than 5 requests, Kong will respond with the following error message:

Looking good! We have added an API on Kong, and we added rate-limiting in just two HTTP requests to Kong’s admin API.

It defaults to rate limiting by IP address using fixed windows, and synchronizes across all nodes in your cluster using your default datastore. For other options, including rate limiting per consumer or using another datastore like Redis, please see the documentation.


Better Performance with Kong Enterprise Edition


The Enterprise edition of rate limiting adds support for the sliding window algorithm for better control and performance. The sliding window prevents your API from being overloaded near window boundaries, as explained in the sections above. For low latency, it uses an in-memory table of the counters and can synchronize across the cluster using asynchronous or synchronous updates. This gives the latency of local thresholds, and is scalable across your entire cluster.

The first step is to install the Enterprise edition of Kong. Then, you can configure the rate limit, the window size in seconds, and how often to sync the counter values. It’s really easy to use, and you can get this powerful control with a simple API call:

The enterprise edition also adds support for Redis Sentinel, which makes Redis highly available and more fault tolerant. You can read more in the Enterprise rate limiting plugin documentation.

Other features include an admin GUI, more security features like role based access control, analytics, and professional support. If you’re interested in learning more about the Enterprise edition, just contact Kong’s sales team to request a demo.


Share Post: