Things We've Learned Practicing DR
One of the interesting things about our jobs is that we are paid to be professional paranoids, and disaster recovery is one of those places where it is actually a good thing to be as sick as we are.
Unfortunately for the OpStack partners, we have had to have had to manage through three full facility disasters in our careers. What follows are some lessons we have learned from the experience.
Anything you haven't tested won't work. It is when you do an actual DR test, and that means a "plugs out test" that you learn. When you shut down production and see if you fail over as you're supposed to.
The first test will probably fail. We can pretty much guarantee that the first time you test, you'll find out what you didn't think about.
The second time you do it will be better, you'll find that your assumptions of synchronization were wrong.
The third time, you've got a shot at getting it right.
And, once you’ve failed over successfully once, practice it quarterly or at least semi-annually, then you have a decent chance of actually keeping your business running by meeting your Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
If you don’t, there will be consequences.
A former CTO and colleague, found himself out of work for three years, because on 9/11, he had his code and data backed up to tape with copies at a large commercial DR facility. But they had not tested recovering from tape at the DR site. When they did, the tapes proved to be unreadable (presumably due to head misalignment) and restoration of service took three weeks instead of three days.
The losses had to be reported to the SEC as well as to the Fed and the blame fell on IT leadership. If you don’t fully test your DR plan, it is only a business continuity hope and not a plan.
On a positive outcomes side, at Credit Suisse where the systems were appropriately failed over and where there was non-prod available, we found that 9/11 changed the behavior of the markets and by being able to actually take advantage of that and make new releases that allowed us to handle the immediate market volatility that happened in September, October and November of 2001. The bank made an extra billion dollars. That's a lot of money even for Credit Suisse and they took business away from the rest of the street, a success story in a very tragic time.
Brilliance isn't always doing something innovative, sometimes it is just doing the hard things correctly. Disaster recovery might not be the most fun or glamorous IT job, but it is one that you need to master. Learn from experts who have been in your shoes before and get a leg up on building an effective DR strategy for your business.
If you would like to discuss in more detail, please reach out to evan@opstack.com.
We start where you are and get you to where you want to be.