Reliability, performance and security are huge factors to consider when developing a product online.
They can determine the success or failure of your app – what good is all that app development just for it to get knocked offline? With challenges growing in complexity software development brings with it new demands.
What is an SRE and what value do they bring to a software development project?
The demands include reliability and availability – the two can be covered by a person or team who configure self-healing systems and resolve warnings before they become fires.
But there is a certain type of engineer whom using SRE methodology and DevOps best practices can predict these problems and stop them from happening - a Site Reliability Engineer.
What is a Site Reliability Engineer?
A ‘Site Reliability Engineer’, or SRE, is an engineer / developer that thinks ahead of automation, infrastructure and application scaling, monitoring, security and disaster recovery of your application or website.
“Fundamentally, it’s what happens when you ask a software engineer to design an operations function.”
– Ben Treynor Sloss, Vice President, Google Engineering, Founder of Google SRE
They ensure that the software and hardware that make your app accessible to the world performs at its best while being secure, reliable and ready to scale during those user traffic peak times. An SRE is a relatively new role and was introduced by Google due to the unique challenges the company faced because of their incredible scale and lack of business perspective among engineers (e.g. network, software development).
In a world of changing technology, new demands, growing complexity, and insane amounts of data – dedicated people are required to keep the show running, prepare and get ready for the tough times.
This is where the SRE motto “hope is not a strategy” comes into play.
An SRE’s job is to plan and strategize before a crisis happens – they do everything to avoid it and, should the worst happen, to minimize its possible impact on your users and revenue.
What is a Site Reliability Engineer's job?
In a nutshell, a Site Reliability Engineer predicts problems and resolves them before they happen.
Imagine that you’re running an e-commerce website whose traffic increases tenfold before Christmas, Black Friday and Valentine’s day – an SRE ensures that your servers and the applications running on them will scale appropriately so that customers do not experience slowdowns or dropped connections, and you do not lose out on revenue. An SRE also designs and implements recovery and standard operating procedures (SOPs) when the unexpected actually happens.
The same goes for businesses like flight booking, event ticket sales, news websites and any other businesses whose Internet traffic fluctuates significantly.
More extreme examples are mission-critical systems – such as medical, flight-control or banking software. Outages in these industries are completely unacceptable and could carry very serious consequences – it’s an SRE’s job to make sure these problems don’t happen.
The security aspect of reliability is often neglected or does not take the business perspective or actual scenario probability for the particular system or company. SRE understands and knows how to properly implement security procedures and what actions should be taken during the attack or post such incident.
Of course, you don’t have to be running NORAD missile control to worry about site availability.
Very often the websites deployment process doesn’t take into account rollbacks or is coded poorly – but it’s important to always be serving customers even during deployment. Sometimes the service architecture can’t keep up with the business requirements.
Whatever the case may be, site traffic is also an important factor – some websites receive little traffic throughout the year and see a x100 spike at the end.
This means often systems are designed to account for this ‘peak’ but end up overspending the rest of the year. Or, you prepare for the median only to get hit with high volume all of a sudden.
In both cases, an SRE can help by tailoring your systems to react appropriately and intuitively.
How do they do it?
The job of an SRE is organized around three phases: planning, launch, and maintenance.
Even before a project starts an SRE prepares for the future by:
Gathering business requirements
Analysing risk factors
Calculating possible traffic surges
Estimating budget requirements for handling potential breakdowns
When the project launches an SRE will not only contribute by designing a resilient infrastructure that meets all your business and reliability requirements – they will also show your development team how to implement it correctly.
An SRE’s job never really ends – when your site goes live, they will continue to monitor the scalability of the app 24/7/365 and suggest solutions if changes are needed.
For example, if your traffic grows at an unexpectedly high rate, a new infrastructure strategy is required. Site Reliability Engineers also take care of automation and backups – making sure your developers can concentrate on creative work and that no data is lost during an outage.
What skills will an SRE bring to your team?
You can think of an SRE like a trained hostage negotiator walking into a crisis situation – they not only have the hard skills to tackle the problem but can remain cool throughout.
This is important because during a crisis, blame rarely helps – the focus should remain on the problem and finding the most suitable solution, fast. SRE’s need to have a very strong mix of hard coding skills as well as infrastructure knowledge.
When everything is falling apart – you want someone who can connect the dots and troubleshoot the problem. That’s why an SRE’s role is a combination of developer, DevOps engineer, and systems administrator – they focus on:
Infrastructure
Business
Operations
Scalability
Reliability
Why should I hire an SRE?
Here are four reasons why you should hire a Site Reliability Engineer:
Minimize site downtime – crucial in a world where customers expect 100% availability with online services.
Estimate and mitigate risks – some problems can’t be avoided but planning can stop disaster from striking.
Faster development – SRE’s bring automation and accurate resource sharing letting your developers work quicker.
Money saved – pay only for what you use, increase sales at peak service times and no service down-time losing you revenue.
Imagine you have a site where every second of downtime costs you $10,000 and you serve millions of customers – you need someone that can help you serve each client at any time, without fail.
An SRE’s job is to make sure your site is online at all times.
In fact, your site will be so secure that you could literally run around the datacenter pushing power buttons and the system should work around it. Great preparation for “bring your kids to work day”!
Is Site Reliability really a big deal?
Yes. We will keep the answer short on that one.
With sites like Netflix serving 86 million customers over 1000 devices the need for site reliability has never been greater – and all the trends show 2019 is going to be another year plagued with large scale attacks. Google currently has over 2000 SRE engineers on board and they are continuously hiring.
If you’re serious about scaling, automation, and reliability – SRE practices are the way to achieve them.
The challenges faced by a Site Reliability Engineer
Being an SRE is not an easy job, especially if done well. Modern software projects are growing exponentially more complex, so there is always more to learn and take care of. An SRE always needs to be two steps ahead – they need to be well-versed not just at what’s going on now, but what’s going to happen in the future.
Keeping up with the progress of technology and the demands of the business at the same time is a difficult task, which is why SREs are extremely high-value team members.
What does the future look like for SRE's?
With the demand and complexity of online applications rising the need for SRE’s will continue to grow.
Site reliability engineers face a unique set of challenges – not only are they tasked with keeping the current infrastructure up-to-date but they need to stay two steps ahead of the game. This makes them highly valuable team members as they keep their fingers close to the pulse of technology changes.
More and more companies are developing web applications and mobile apps, while the darknet continues to grow in popularity. These two factors contribute to a rising demand for the need to hire a site reliability engineer.
Summary
A Site Reliability Engineer is a member of your team who takes care of infrastructure, scaling, automation, reducing risk while enabling faster and safer deployments, tooling, backups – all of this before goes on during, and after the launch of your project.
An SRE makes sure that everything runs smoothly and, when it doesn’t, to provide future-proof solutions.
An SRE can add a lot of value to your project and save you money in the long term by preempting problems within your application or infrastructure.