The aim of this article is to show How and Why we decided to test our DRP (Disaster Recovery Plan) under actual conditions… inside our production environment.
What is a DRP?
Toucan Toco’s solutions are historically SaaS solutions hosted on bare metal servers.
Of course as a SaaS editor we follow the usual best practices about our infrastructure:
- everything is deployed, configured and set up with our Ansible shipping scripts
- all the infrastructure is monitored by our own agents (with an ELK stacki, beats, Elastalert) and external services (like StatusCake tests automatically created by our Ansible module)
- each customer stack and data are backuped and encrypted by private GPG keys (one different for each customer) and exported to a dedicated storage server in another datacenter
- we have failovers and load balancing on our main services
- and finally we use tools and scripts to manage and restore our customers stack and data
We are confident all these things work. As time goes on they become more and more robust because we use them everyday. However, we wondered if we would be ready should in case the datacenter of our servers provider exploded?
Even if the probabilities are pretty low, we need to be ready and know how to react.
Because we are a SaaS solution with customers relying on our product on a daily basis, and committed to a SLA, it’s just mandatory to know how to handle the worst.
The disaster recovery plan (also known as DRP) becomes your operationnal team’s capacity to manage major issues.
At our scale, having a fully replicated infrastructure over different data centers did not make sense.
We have a practical mindset. If a a natural disaster were to happen, we wouldn’t need to have 2 functionnal, identical and up to date houses because we would be able to rebuild a brand new one in less than 2 minutes.
And since we had all the tools to restore our client’s stacks and services, our backup, snapshot and migration, it was easy to connect the dots and imagine a DRP.
For a small team like us it was an interesting quick win.
This summer we wrote our complete disaster recovery plan. Our DRP is a set of docs, procedures, methods and scripts to recover the Toucan Toco’s business after a disaster hit our data center.
At the same time, we also bought some spare servers in a different datacenter. They are up and production ready and they will be our fallback in case of a major issue.
Why should we test a DRP?
Isn’t it obvious? :D
You need to test it before a disaster happens.
Otherwise it’s just a theory and we don’t buy into the famous saying Don’t worry it should work
If you read this post, you’re probably an IT person and you know very well the Murphy’s Law: if something can go wrong… it will.
So you need to test your plan for several reasons:
- to make sure it works ;)
- to adjust it
- to make the team confident about it and avoid the oupsy! do you know what we need to do if X totally crashs symptom
- because it’s a part of the tech culture: we want to test everything!
That’s not all.
Inspired by the chaos monkey tests and approach from Netflix, we strongly believe these kind of tests should be run for real, in a production context.
It’s like learning how to swim: you can learn the moves outside of the pool, but if you want to prove to yourself you can swim, you need to do it at the deep end of the pool.
How to test a DRP?
Once our DRP was ready, we planned a crash test in our production environment. Only few people were in on the secret, the rest of the team only knew something was planned without any details.
Why? to reproduce the context, the “surprise” and the stress of the situation.
However, we targeted a part of the infrastructure with no direct business impact. It was the very first time and we are not totally crazy :P.
We wanted to avoid the “emergency buildings test pattern”: they are never done the right way. The alarm rings, people grab everything (phones, bags, laptops…). They walk casually talking with their colleagues and finally they get out of the building… because they know it’s a test.
These tests are not relevant and not monitored on actual conditions, when a real emergency occurs it’s a huge mess. People are not always hurt or killed by fire but because of panic and stress.
Back to our subject. The main goals of the tests are:
- to confirm our monitoring, scripts and procedures work
- to check how the team reacts, communicates and follows the instructions and procedures during the incident
- to validate that the tech team is able to face everything without my help. I’m the only official SRE/SysAdmin/DevOps/DevOops! (call me whatever you prefer) in the team and it should not be a problem for Toucan Toco
- to say “we’re ready!” and to inspire trust to our clients
To simulate a real disaster case was pretty simple.
We decided to just drop all the network traffic from our partners, demos and trainings infrastructures.
Let’s see how we’ve done it.
Release the kracken!
Everything is burning and I’m watching the fire!
As previously explained the aim is also to validate the tech team is able to manage the situation without me. So during the issue, I was just taking notes and checking the following points:
- how much time did it take for the team to get back to a normal status
- how well did the team understand the issue
- how well the monitoring tools, dashboards and logs were used
- how well the team followed the procedures
- how was the communication between the tech team and the rest of the Toucan Toco’s teams
- how did they validate everything was ok and back to normal
During the simulation, I asked questions to challenge the team decisions, choices and to make them doubt themselves a little… ^^ (otherwise it’s not funny)
Finally after 19 minutes of downtime, the whole partners, demos and trainings infrastructures were reinstalled, restored with the latest backups on “from scratch” servers in an other datacenter by using only scripts without doing anything manually and without me.
GG Team !
For a first test, we were pretty satisfied by the result.
Even if it’s only a part of an infrastructure, the procedures and the scripts stay the same.
Knowing what we know, if we had to reinstall and restore a complete Toucan Toco infrastructure from nothing (all clients stacks, private and public services, CICD, monitoring….), it would take us less than 2 hours.
For a small team like us, it’s not too bad, we know we’re ready and the Toucan Toco team is fully autonomous to face a disaster without me.
We need now to improve our DRP but we know it’s a “never ending” job.
Because we did the test in real conditions, we know what we need to change parts of our monitoring, logs, scripts and documentations.
Since that day, we learned where to focus:
- changing some monitoring checks and outputs to help understanding the issues faster
- adding some logs and slack notifications to be able to simplify the restoration process tracking
- getting a better automated way to send the notification emails to our partners and customers
So finally what to do next?
I will enjoy my holidays because I’m French, and plan another test to confirm we became better :)