Go Global or Go Home: Monitoring Your Platform From Multiple Locations
In the current context, many businesses live with the ambition of scaling up and having their product used globally. This brings in new challenges for the software platform: are customers able to access and use the product from different parts of the world? A critical way of tackling this is creating an effective strategy for alerting and monitoring availability from different areas of the globe. Setting up a system of monitoring production availability in multiple regions might seem pretty straight-forward if we’re looking at this strictly from a tooling perspective: Postman, DataDog, New Relic, AWS, Uptrends and Cloudflare (and the list can go on) all offer solutions that are super easy to set up, manage and maintain. However, the real challenges are behind the use of the tool: – How do we select the regions and locations to monitor? – Do we want to monitor an exact particular location (e.g., a city) or broad areas (e.g., countries)? – What do we consider a failure (e.g., typical HTTP error codes, certain time thresholds)? – Do all the selected areas have the same importance to the business? – Do all the monitored areas trigger the alerts with the same priority? – How often do we want our platform to be monitored? – Who owns the actions to fix this in case of failure? – How do I flag these monitors so I don’t mess up with company data? – How do we design the monitors so in case of failure, we don’t get a cascade of alerts? In this session, I will answer all of the above.