How to Build a Cloud Infrastructure That Prevents Outages

Businesses thrive on the availability of functional cloud infrastructures that allow them to deliver a variety of cloud services to customers and end-users in a continuous, uninterrupted manner.
For modern organizations with robust data centers, even a single cloud outage translates into significant financial losses.
Yet, cloud infrastructure outages continue to affect the business continuity of multiple organizations, including both cloud vendors and their biggest clients. Earlier this year, we witnessed a series of cloud outages that shocked Amazon, Twitch, Reddit, eBay, Paypal, as well as several news and government websites.
The problem was traced back to Fastly, a cloud platform that helps developers extend their cloud infrastructures. While the cloud outage was brief, it affected the flow of millions of dollars in revenue and added to the disaster recovery bills. However, the cost isn’t just financial - cloud outages frustrate end users, affect staff productivity, increase the risk of data loss and damage your brand’s reputation.
In this episode of Disruption Talks, we invited Radosław Kubryń, Cloud Engineering Manager at Netguru, to explain what happened. Radosław gives his take on what went wrong at Fastly and the lessons we can all learn from it to build cloud infrastructures that prevent outages.
What is a web outage?
A cloud outage simply refers to the period of time when a website or cloud service becomes completely inaccessible to users or is unable to perform as intended. An outage can be planned, e.g. for maintenance, but it can also be unplanned. The latter can incur significant financial losses, especially if the mission-critical applications aren’t brought back to operation quickly.
This 2019 Forrester survey found that companies experience planned or unplanned cloud outage at least once in every quarter. Unplanned downtime tends to surprise companies more often - almost half of respondents indicated they plan an outage once every three months, but experience an unplanned outage bimonthly. What is more, the survey found that an unplanned downtime costs 35% more per minute.
It’s impossible to precisely measure the frequency of outages, their average duration and their true cost due to scarcity of data. Statista reports that an hourly outage costs enterprises anything between $10k and $5m in 2020. A quarter of respondents estimated the hourly cost at $300-400k.
Another study by Dunn & Bradstreet found that almost 60% of Fortune 500 companies experience an outage of 1.6 hours per week. That can translate to as much as $46 million per year! Finally, an earlier study by Ponemon found that the maximum cost of a data center outage in 2016 was almost $2.5m and demonstrates that these costs tend to rise year on year.
Multiple factors can contribute to an unforeseen cloud outage. The most common causes include:
- A power outage, which can occur due to a natural disaster or a cooling system failure
- Technical, infrastructure issues, including component failures
- Human error
- Cybersecurity attacks, such as DDoS
- Networking issues
- Failed backups
- Software bugs
According to Radosław Kubryń, the key to mitigate all these risks is a stable infrastructure. He discusses the topic in detail in this Disruption Talks live stream. Read on to find out what it takes to build one.
Filip Sobiecki: Could you explain what Fastly is and why we couldn’t access all those web pages?
Radosław Kubryń: Fastly is a content delivery network, and it’s one of the largest on the internet, along with Akamai and Amazon’s CloudFront.
All of these operators started with the same principle – the internet is faster and more stable if the user can connect to a server that’s physically close to them.
With Fastly, it’s believed that one of the Fastly developers made a mistake in the code and deployed that into production. So, when they updated the settings, it triggered the flow, which ultimately locked down around 85% of the company’s network.
What was the business impact of the web outage?
I think we can compare it to banks and their infrastructure. If the infrastructure goes down and stays that way for hours, then we potentially lose a huge amount of money. In this case, the estimation is that Amazon may have lost around $7,000 every second it was down, and the outage was around two hours.
That’s why we need to think about a cloud insurance policy. It’s like if you buy a car, get insurance, and then have an accident, it saves you from having to pay out from your own wallet. If you don’t have insurance, you end up losing a lot of money. It’s the same with cloud infrastructure. You need to get some kind of insurance policy.
What can a company do to protect itself?
Let’s start with how cloud infrastructure is built because it’s important to understand that first. The best comparison is to think of Lego bricks.
Cloud infrastructure is built using different bricks, and you can add additional bricks to strengthen your building, just like Lego bricks.
The key to a good structure is having a highly available system with a carefully designed mechanism for load balancing. This distributes client requests and also specifies the failover process in the event of a node failure. The benefit is that it can quickly switch to another one for a backup if there’s an outage.
You need to create a good foundation in your building. That’s your infrastructure. The different levels on top of that are the applications and software.
While a company may not be able to protect itself 100%, what other steps can be taken to help prepare itself?
Yes, you cannot protect yourself 100%, but you could about 99%. So, then there’s only a 1% risk of an outage.
I think what we need to do is more plumbing, less firefighting. We need to look for gaps in our infrastructure, in our code and fix them before they become a problem.
In my article, I mentioned that we need to create a high availability infrastructure. That’s the first step. The second step is testing. We need to test our infrastructure and look for gaps. We cannot be sure that our infrastructure is always good because it’s always changing. We need internal and external tests to be sure that any DDoS attack or ransomware cyber-attack cannot cause an outage.
The third step is to use backups. If we have backups, we can be sure that if something happens to our infrastructure or software, we can easily restore it. You also need to test those backups as well, though.
What are the short-term steps that I can do today or in the next month or two?
At the moment, you probably need to check your infrastructure and applications to see if there are potential gaps. To make sure that everything looks good, you will want to do some kind of an audit of security and performance. You’ll also want to repeat this and check again and again.
Netguru provides clients with cloud audits. What would you say are the benefits of such a service?
As DevOps cloud engineers, we check your infrastructure and software, giving you a full report about any potential gaps, performance tests, and so on. In that report, you can find lots of useful information and a summary about your infrastructure and what steps you need to implement to ensure it’s fully secure.
Preventing cloud outages starts with a solid infrastructure
Cloud vendors are required to provide continuous cloud services at all times. Neither cloud providers nor their customers can afford any cloud downtime; these outages simply cost too much. Yet, cloud infrastructure outages still occur even among the biggest cloud providers, including Google Cloud, Microsoft Azure, and AWS, disrupting business continuity and frustrating end users globally.
Poor infrastructure is the root cause of most unplanned cloud outages, but it can be rectified by investing in more resilient data centers. Every business should be proactive about these endeavors and they should go beyond sifting through service level agreements to find a promise of 99.999% availability.
By working closely with their cloud service provider, businesses should ensure the complex system they build has a strategy for eliminating cloud downtime and ensuring the highest availability. Stable cloud infrastructure is crucial - it’s the indispensable foundation for the desired, ongoing performance of software and applications.
Data centers must be prepared to deal with a potential power outage, component failures, or networking issues by ensuring appropriate backup processes as well as disaster recovery procedures. If you are uncertain whether your infrastructure has appropriate mechanisms in place to deal with such issues, you can contact your IT provider to schedule an audit of your entire cloud infrastructure.
This discussion is part of our Disruption Talks recordings, where we invite experts to share their insights on winning innovation strategies, the next generation of disruptors, and scaling digital products.
To get unlimited access to this interview and many more, sign up here: Disruption Talks.