Spring cleaning may be an annual ritual for your home, but it is also a best practice in business. And in my area of business, IT operations, one of my favorite areas for “cleaning up” is making sure that our systems meet the availability and recovery expectations of our business. Our clients, and any enterprise for that matter, should consider the same approach. Here are the five steps to take:
1. Start by reviewing service level agreements with your business partners around availability of key systems, as well as the expected recovery point and recovery time objectives in the event of a disaster. If these are well understood and reviewed periodically with your business partners, instead start by reviewing the expectations so you can be sure your teams are architecting and implementing with the business needs (and budget) in mind.
2. From an availability standpoint, use your next scheduled downtime window and test the failover and high-availability features supporting your most important business applications. Too often, systems are deployed in a highly available method and then when components fail, systems do not failover as expected. Issues as small as cabling mistakes or hardware driver updates can make a highly available system not fail-over as designed, so regular testing should be implemented for the most critical systems.
3. No one likes to think about disaster recovery—both because it forces us to think about significant events and because we often know we cannot recover as well as we would like. But while a large disaster may be unlikely, small ones happen regularly. And if you do not test, you do not know where you stand. Simulate a few small disasters, and if you haven’t executed a full disaster recovery test in the past year, then execute a full end-to-end test. Be especially thoughtful to check newly deployed or upgraded systems which may have not have experienced a full disaster recovery test yet. Compare your results against your stated recovery time and recovery point objectives. And, don’t forget to test communication methods to your IT team and the rest of your employees.
4. While you are thinking about recovery, it is also a good time to review your backup systems to ensure they are backing up all your critical data and working correctly. Select a few backup sets and test recovery to make sure everything is working correctly. We recently observed a client who experienced a Crypto-locker attack that required the recovery of 125,000 files making up almost 150 GB of data. Since backups had been recently tested, we knew we could recover the data in less than two hours.
5. During your spring cleaning, it’s a good time to evaluate whether you’re maintaining good IT hygiene. The key tenets of good hygiene are: