How We Used the Incident Management Process to Minimize the Impact of the Hubspot Outage
Within a business, many things can go wrong unexpectedly that you will have little or no control over. The more complex your business is, the more providers you probably rely on. This means that your website could go down, your communication channel might stop working, or you may experience data leak.
You can’t be 100-percent ready for all possible scenarios. However, you can develop a process of mitigating the negative impact of any unexpected event that comes in your way. We’ve just tested our processes in battle.
Last Thursday, Hubspot experienced an outage that affected many websites hosted on their servers, including the netguru.co domain. Having our website down for a few hours wasn’t the only challenge we faced on that day. A default message that visitors would see when entering our site was saying that our account expired. Our first reaction was to check the last paid invoice, but everything was fine on the financial side of things. Such false message wasn’t something we would like to communicate to our audience. The key was to react quickly to this unusual problem and mitigate its detrimental impact. Here’s how we did it.
We learned that the Netguru website was down within one minute from the moment it happened. It took us only seven more minutes to identify the root of all the trouble, which, as it turned out, was our provider’s problems. At that time, there was not much we could do to bring the website back. The Hubspot engineering team began working on resolving the problem right after the internal monitoring systems alerted them to the outages within the content system. We weren’t sure how long it could potentially take, so we needed to take some steps to change the message displayed during the downtime.
Solution and Mitigation
After a quick status meeting that involved people from a number of teams (Site Reliability Engineering, Quality Assurance, Project Management and Marketing), we decided to redirect the traffic to a static landing page, so that the visitors could not see the misleading Hubspot message anymore.
The marketing team promptly wrote a copy that we later put on a static page.
Engineering team prepared Heroku servers, and, 73 minutes after we spotted the problem, the “we apologize” page was up and running.
We set the HTTP response codes on the page to 503 instead of 200 so as to prevent the outage from affecting our SEO rank.
We also decided to add a Drift chat capability to the site so that our clients and other visitors could easily reach out and get any help or advice needed. Within less than two hours we set up our temporary solution, whereas we had to wait another 2 hours for Hubspot to get their system up and running.
Once Hubspot told us they had resolved the issue and their website should be up, we waited another 1.5 hours to move from the static page back to our domain. Firstly, we still wanted to test if it really worked fine and, secondly, we didn’t want a sudden traffic spike to cause another outage. In less than 5 hours, the Netguru domain was back up and running perfectly fine.
Although it is was an unusual situation for us, our team was able to react quickly and work together to find the optimal solution. Monitoring and alerting systems performed as expected. We were aware of the problem the moment it started. Our internal status page kept everyone in the loop about the ongoing progress in resolving the issue. Finally, gathering experts from different departments for a meeting helped us quickly develop our response to the problem. There was no sign of panic or uncertainty about our plan.
Key Takeaways for the Future
We believe that we handled that issue well. The root of the problem was on the provider’s side, so our capability of fixing it was limited. Still, we managed to mitigate the negative impact of the outage and prevented the domain from showing to our clients and prospects an unfortunate message, which looked unprofessional and didn’t evoke positive connotations. That said, there are still some areas where we think we could improve in the future, so it will take even less time to respond:
creating a clear set of rules when to move to a static page;
creating a playbook with the details on how to switch the domain to a static page and how to roll back;
ensuring that our on-call engineers have the proper credentials to execute the playbook without escalating.
This incident was a good test for our team as to whether we can react to unexpected problems with our website. Even though there is still some area for improvement, we believe that we tackled the issue successfully.