Roll the dice testing was an interesting new concept introduced to me in September whilst at Microsoft Ignite 2016 however, ever since I’ve found myself explaining it to existing and new customers as a fantastic way to perform disaster recovery testing.
All too often when we think about DR planning, exercising/testing we specifically choose exactly what we are testing which doesn’t accurately reflect a real world scenario of what we might need to recover from. As an example you might have an Exchange DAG setup within your data centre and test that DAG but what if we lost the data centre? Do we test for that? Do we even consider that a possibility?
If there’s anything that working in a data centre environment has taught me over the years it is that we are consistently progressing resiliency options for what we might expect to happen, and we find solutions to what has happened in the past but what do we do when the unexpected happens?
Thinking back to some unexpected outages over the past couple of years one incident that rings true for being unexpected more than any other is the case of an electrician who was working on a power feed he had isolated but had then been enabled again without his knowledge. The electrician received a life threatening shock and the power was immediately isolated causing all systems within the data centre to be shut down for a significant amount of time whilst emergency services were in attendance and even longer whilst an initial investigation into potential wrong-doing was conducted.
In another scenario we know that there is fire suppression within data centres to protect crucial equipment but often this is linked to a shutdown of power and cooling which in some cases can lead to extensive outages.
So the big questions is, how can we possibly provide resiliency in unexpected scenarios such as these examples? I believe roll the dice testing has a large role in determining what to do.
Roll the dice testing initially involves drawing out the different elements of your infrastructure into a 2×3 grid or 2×6 grid and so on depending how many components you think are relevant. You can even go further, having a grid for location/department and another grid for sections within that location/department. As an example you may have your different global offices and data centres and then a 2nd grid for the different racks or suites you operate within that data centre.
The next step is to grab some dice comparative to the size of grid you went for (why we work in sixes in case you hadn’t worked this out!) and treat it as though that part of your company has been completely lost. Now start to work out what systems are down, the DR protocol for recovering that, the expected time for recovery and the impact to the business. Scary? Well we’re not finished yet. Roll the dice again and you’ve now lost another section 30 minutes after the first. What’s the situation now? You wanted to look at the unexpected – would you expect to lose 2 parts of your infrastructure at the same time? My guess would be no, but it happens and those are the events which have the biggest impact.
So, this is in essence a simple way to completely randomise your DR testing and get you thinking about the unexpected. The more diverse you are with testing the better, and if you find a weakness in your DR testing or protocol then come and speak to us and see how we deliver highly resilient and highly available infrastructure from multiple locations at the same time up to 3,500 miles apart. Yes, you could theoretically still lose both data centres at the same time, however, the element of risk is substantially different.
By Matt Parkinson on January 11th, 2017