How to Build a Cloud Infrastructure That Prevents Outages

Filip Sobiecki

Jun 29, 2021 • 6 min read
Skyscraper and clouds

Earlier in June 2021, a series of web outages shocked some of the biggest websites out there, including Amazon, Twitch, Reddit, eBay, Paypal, and several news and government websites.

While the outage was brief, it affected the flow of millions of dollars in revenue for some of the biggest online markets out there.

After investigations, the problem was traced back to Fastly, a cloud platform that helps companies and developers extend their cloud infrastructures.

In this episode of Disruption Talks, we did some digging and invited Radosław Kubryń, Cloud Engineering Manager at Netguru, to explain what happened. Radosław gives his take on what went wrong at Fastly and the important lessons we can all learn from it.

What is a web outage?

A web outage is simply when a website becomes completely inaccessible or is unable to perform as intended. For websites that rely on eCommerce sales, this can have a huge effect on revenue and future business growth, as was the case with the Fastly outages.

Several things can cause a web outage, but the key to avoiding it is a stable infrastructure. To explore this in more detail, Radosław Kubryń gave some solid tips and advice on how to prevent web outages in his recent article, How To Bulletproof Your Company For Web Outage In Three Steps.

He discusses it further in this Disruption Talks live stream, which you can read about below, or watch the full episode to hear his thoughts.

Filip Sobiecki: Could you explain what Fastly is and why we couldn’t access all those web pages?

Radosław Kubryń: Fastly is a content delivery network, and it’s one of the largest on the internet, along with Akamai and Amazon’s CloudFront.

All of these operators started with the same principle – the internet is faster and more stable if the user can connect to a server that’s physically close to them.

With Fastly, it’s believed that one of the Fastly developers made a mistake in the code and deployed that into production. So, when they updated the settings, it triggered the flow, which ultimately locked down around 85% of the company’s network.

What was the business impact of the web outage?

I think we can compare it to banks and their infrastructure. If the infrastructure goes down and stays that way for hours, then we potentially lose a huge amount of money. In this case, the estimation is that Amazon may have lost around $7,000 every second it was down, and the outage was around two hours.

That’s why we need to think about a cloud insurance policy. It’s like if you buy a car, get insurance, and then have an accident, it saves you from having to pay out from your own wallet. If you don’t have insurance, you end up losing a lot of money. It’s the same with cloud infrastructure. You need to get some kind of insurance policy.

What can a company do to protect itself?

Let’s start with how cloud infrastructure is built because it’s important to understand that first. The best comparison is to think of Lego bricks.

Cloud infrastructure is built using different bricks, and you can add additional bricks to strengthen your building, just like Lego bricks.

The key to a good structure is having a highly available system with a carefully designed mechanism for load balancing. This distributes client requests and also specifies the failover process in the event of a node failure. The benefit is that it can quickly switch to another one for a backup if there’s an outage.

You need to create a good foundation in your building. That’s your infrastructure. The different levels on top of that are the applications and software.

While a company may not be able to protect itself 100%, what other steps can be taken to help prepare itself?

Yes, you cannot protect yourself 100%, but you could about 99%. So, then there’s only a 1% risk of an outage.

I think what we need to do is more plumbing, less firefighting. We need to look for gaps in our infrastructure, in our code and fix them before they become a problem.

In my article, I mentioned that we need to create a high availability infrastructure. That’s the first step. The second step is testing. We need to test our infrastructure and look for gaps. We cannot be sure that our infrastructure is always good because it’s always changing. We need internal and external tests to be sure that any DDoS attack or ransomware cyber-attack cannot cause an outage.

The third step is to use backups. If we have backups, we can be sure that if something happens to our infrastructure or software, we can easily restore it. You also need to test those backups as well, though.

What are the short-term steps that I can do today or in the next month or two?

At the moment, you probably need to check your infrastructure and applications to see if there are potential gaps. To make sure that everything looks good, you will want to do some kind of an audit of security and performance. You’ll also want to repeat this and check again and again.

Netguru provides clients with cloud audits. What would you say are the benefits of such a service?

As DevOps cloud engineers, we check your infrastructure and software, giving you a full report about any potential gaps, performance tests, and so on. In that report, you can find lots of useful information and a summary about your infrastructure and what steps you need to implement to ensure it’s fully secure.

This discussion is part of our Disruption Talks recordings, where we invite experts to share their insights on winning innovation strategies, the next generation of disruptors, and scaling digital products. To get unlimited access to this interview and many more, sign up here: www.netguru.com/disruption/talks

Related topics

More posts by this author

Filip Sobiecki

Fuel your digital growth with cloud solutions