What is an API outage (and how do you fix one)?
An Application Programming Interface (API) is a software endpoint that allows communication between two separate services. APIs are typically accessed by humans, applications, or even other APIs using the HTTP protocol.
But what happens when there's an API outage and those APIs go offline? For end users, an API outage can be an annoyance. But for businesses, an API outage can be detrimental to revenue, customer retention, and overall brand.
In this post, we'll look at the common reasons for API outages, the effects of such outages, and what steps to take to address them. We'll also introduce Kong Gateway and see how it can ensure API availability.
What is an API outage?
When an endpoint becomes unavailable, indicated by errors like "site does not exist", it's called an API outage.
An API outage usually means a service disruption. Depending on the APIs functionality, this disruption can be a minor inconvenience or a business-disrupting catastrophe.
For example, an API outage in Discord
may mean the entire service is unavailable. In contrast, an outage of an API for looking up postcodes for an online registration form may only partially affect the application.
How API outages impact businesses and clients
When an API outage happens, it affects several stakeholders. Businesses can lose revenue if customers cannot access their products or services. It can also negatively impact the business's reputation. If a company can't (or doesn't) prevent such outages, its customers will look for other providers.
An outage of public service APIs means users can't access one or more services offered by local or central governments. Similarly, critical systems like patient medical history or automated bank transfers can experience significant delays when their APIs are inaccessible.
API downtime can have a severe financial impact as well. For example, according to Fortune, a global outage in 2021 cost Meta (formerly Facebook) almost $100 million in revenue.
What causes an API outage?
Let's consider some of the most common reasons for API outages.
Infrastructure failure
When resource usage exceeds capacity, servers in an API backend may run slowly or become unresponsive. Individual nodes in a cluster can stop working due to local hardware failure, operating system crashes, application crashes, or loss of network connectivity to the rest of the cluster. When a node fails, clusters without failover capabilities can take the API endpoint offline.
Container failure
A microservice-based application consists of many loosely coupled, independent pieces of software or services working together to deliver the overall functionality. Typically, microservice APIs are hosted on containers. Containers provide a complete runtime environment for the code hosted on them.
Container orchestration is the layer that streamlines container deployment, management, scaling, and networking. Microservice APIs running on containers without proper orchestration may fail when the containers crash.
A widely adopted container orchestration technology is Kubernetes. APIs can experience outages when pods in a Kubernetes cluster crash and don't come up in another node. Kubernetes pods can crash for many reasons, including the following:
incorrect configuration
out-of-memory events
readiness probe failing
liveness probe check failing
application crashing
Network failure
Networks without redundant paths, connections, subnets, or devices can experience outages. The network architecture for APIs, therefore, should include failover zones, routing mechanisms, and load balancers.
High availability at a network level means the loss of a subnet in one part of the network does not disrupt the API. Client requests are seamlessly routed to a standby subnet with the same service running in it.
Without proper routing configuration, API gateways may not be able to send client requests to the appropriate API instance. An API gateway must employ either client-side or server-side discovery patterns to route requests to available service instances.
APIs may experience disruptions if they are not load-balanced. Load balancing ensures that when multiple instances of the same service are running, client requests are directed to the least busy instance. Without load balancing, one or more instances of a service can become overloaded while others remain idle.
Unrestricted client access
An API without any client request rate limit can quickly become overwhelmed when a significantly high number of requests hits the endpoint within a short time. Without restricting client request rates, malicious attackers can take advantage and launch DDoS attacks against an API.
Similarly, too many queries from the same client can cause API endpoints to become unresponsive. Usually, APIs return HTTP error 429
when a client makes too many concurrent queries. Such throttling restrictions apply to all concurrent connections made by the API client; therefore, a 429 response to one method call would result in a delay for all other method calls.
Without any limit on the response rate or the client payload size, API endpoints can become saturated. Attackers can take advantage of these vulnerabilities and send malformed payloads to saturate the service. Using too many concurrent connections can spike the rate of response, adversely affecting the APIs performance.
Application failures
Adding new application features or upgrading existing ones in a production environment without testing can introduce bugs or harmful vulnerabilities in third-party libraries, ultimately breaking the API. Sometimes, application misconfiguration can cause outages, too. Other sources of application failure include unhandled exceptions caused by edge cases or invalid inputs, unsupported runtime versions, or third-party component failures.
Security breaches
Cyber attackers can use insecure API endpoints to send dangerous payloads like SQL injections. These payloads can expose or corrupt backend data, eventually breaking the API.
Attackers can also take advantage of poor rate-limiting to initiate DDoS attacks.
An API with poor or missing authentication and authorization can lead to malicious actors gaining access to backend servers, data, and other resources. These attackers can ultimately make the system unavailable.
How to avoid API outages
You can adopt several measures to ensure APIs are always available or at least to minimize the risk of outages.
Use containers and orchestration technologies to protect against hardware failures, rather than run your API code on physical or virtual machines. Managed container orchestration services like AWS EKS or Microsoft AKS can help in such cases. You can also use service meshes to safeguard against network outages.
Protect your API endpoints using SSL/TLS so that any data sent or received remains encrypted.
Implement a robust authentication and authorization mechanism. These authentication systems typically use OAuth 2.0
for delegated identity checks.
Scan servers for vulnerabilities regularly and patch them with the latest codebase or operating system version can reduce the risk of bugs introduced by third-party libraries.
Adopt a canary deployment strategy for rolling out new service features. With canary deployment, only a few instances of an API are exposed to the changed code. If the new code encounters problems, its rolled back, with the impacted clients remaining small.
Continuously monitor the status of their API endpoints. This is done most effectively through Real User Monitoring (RUM) or Synthetic Monitoring. IT operations teams can also use APM platforms to get insight into every layer of the API infrastructure. The monitoring should also allow for automatic alerting and notification.
Host APIs behind an API gateway to regulate and protect access to endpoints. A good API gateway can implement client request rate-limiting, request size limiting, load balancing, routing, retries, fallbacks, security, and high availability.
Despite all your best efforts, APIs might still encounter outages. A well-designed API needs to return a service status code so client applications can handle errors gracefully.
Preventing API outages with an API gateway
Kong Gateway, the world's fastest and most adopted API gateway, provides an abstraction layer between clients and upstream API services. The API gateway architecture gives you the freedom to upgrade or modify services without affecting clients.
Kong helps overcome API outages with both technical features and functions that address many of the issues covered above. Plus, CX support to build deployments using best practices around resilience, high availability and security, and processes/guidelines. And design-level governance (using Insomnia
) around ensuring security and plugin/policy enforcement in a fully automated fashion using APIOps
.
Plugins to address API outage issues
Kong has a rich ecosystem of plugins
that can address API outage issues. Some of these plugins include:
, which enables you to roll out changes to upstream services to a portion of users. This reduces the risk of breaking the original application for all users.
Authentication and security plugins allow you to implement an authentication service for your APIs, and thereby, secure them.
Proxy caching plugins let you create a reverse proxy cache that serves commonly requested responses. This can also keep the number of unnecessary calls made to the API endpoint low.
Rate limiting plugins can limit the number of client requests made per interval period, reducing the possibility of DDoS attacks. Similarly, request size limiting plugins can block malformed requests or requests with large payloads.
allow you to limit the number of responses sent by the API per minute or second.
The Route By Header
plugin routes client requests to the appropriate endpoint based on the request header. This can greatly improve the APIs performance.
Conclusion
API outages can cause severe problems for businesses, particularly when critical services are involved. In this article, we've covered many of the common reasons behind API outages and how to address them.
No API can be guaranteed to have 100% uptime and no outages whatsoever. However, with well-designed APIs that have high availability and are protected behind a state-of-the-art API gateway like Kong, outages will be kept to an absolute minimum.
To find out more, contact us for a personalized demonstration.