Last Update: Sep 04, 2024 | Published: Oct 14, 2013
This article will introduce you to the concepts of disaster recovery (DR) solution. Once seen as something that only was done by Fortune 500 enterprises, DR has been democratized by virtualization and third-party software vendors. Furthermore, Microsoft offered an amazing solution in the form of Hyper-V Replica, making DR replication possible for small businesses, solving problems for large enterprises, and introducing a new business opportunity for service providers. But you’ll need to learn to walk before you run. You’ll need to understand what DR is and to understand the concepts and terminology before you start learning about one of the most popular features in Hyper-V. Read on for an overview of disaster recovery and what it means for you and your business.
Disasters happen more than one might think. Some like Hurricanes Sandy or Katrina make headlines and the chaos that they create is obvious and widely felt. Tornadoes or floods might destroy part of a small town with barely a mention in the news, but the damage caused to personal lives and businesses is no less real and devastating.
Life does not stop with a disaster. Shareholders, employees, customers, partners, and the community will depend on those businesses once the initial effects are dealt with. A few enterprises, such as stock markets, will need to have zero downtime. Some businesses can survive a few minutes of an outage. And most can survive a few hours or even a few days. But one thing is certain: No modern business can survive the loss of the IT infrastructure that provides them with their data (customer transactions and financial records), their processes (applications and services), and their availability (client computers and/or remote access).
Virtualization, such as Hyper-V, is a superb enabler for disaster recovery. Without virtualization we have to replicate databases, files, configurations, application installations, and so on. The complexity is incredible – so much so that designing a reliable DR architecture was nearly impossible, and testing completely was impossible without affecting those same production systems we want to protect. Virtualization encapsulates our services (operating systems, applications, configurations, and data) into virtual machines. Virtual machines are just files, and files are easy to replicate.
There are a few terms that you should become familiar with when discussing DR.
Those who are new to the concepts of DR often mix up the roles of backup/restore and disaster recovery. The function of backup is to archive content in an offline store. That data or virtual machines can be restored with some effort and with some delay to the original or an alternative location. There is some amount of data loss in restoring from backup and using the restored data as the primary content. The amount of time to restore the business could be significant and the amount of data loss could be huge, from half a day to a week, depending on when backup media were last sent off-site.
The role of disaster recovery is to replicate data or virtual machines to a secondary (or DR) site on a regular basis. The business can quickly bring its services online in the secondary site in the event of a disaster with minimal or even no data loss.
When comparing the functions of IT, DR is seen as being either hot or warm, and backup is seen as a cold copy. Ask an administrator to restore a business critical application from backup and they’ll shiver with dread. Does the entire business really want to rely on a three-cent roller in an LTO tape or a backup solution that even the IT department fears? That’s why we should use:
Backup and DR Replication
Every business owner and CIO will say that they want an RTO and RPO of zero seconds. This is possible, in theory. However, they usually change their minds when presented with a proposal that estimates the cost of such an undertaking, and their real needs are soon revealed. RTO/RPO are ideally zero but there is an exponential curve of cost to get closer to that zero. In reality, most enterprises would be delighted to have an RPO/RTO of less than an hour.
Synchronous replication can be critical piece to getting zero seconds RPO. There are a few problems with this:
On the other hand, asynchronous replication offers:
Think back: What were the worst days of your career in IT? There’s a good chance that it was a time when some essential database was missing or corrupted and you needed to restore it from backup. Your manager was standing behind you with his phone in his or her hand, and you could hear an executive screaming for an update on why the CRM (or similar) system was offline. Am I close?
Now imagine that a fire has destroyed the office, taking down your computer room, or your data center was flattened by an earthquake. What do you think that day will be like? It won’t be a walk in the park, that’s for certain!
Three things make for a BCP that you can depend upon: simplification, practice, and automation. I’ll go through each one.
This is the beauty of virtualization. Virtual machines are simple because they are files. Or at least, that’s the goal. Unfortunately, some people did not get the memo, and they have continued to deploy passthrough disks in their Hyper-V virtual machines. These raw partitions that are presented to virtual machines are inflexible and create complexity. What we want are files, like VHDX files, that are easy to replicate at the storage or host level. At this point in time, it is safe to say that anyone who is deploying passthrough disks is out of touch with Hyper-V and should seek education.
Simplicity breeds success in DR. Virtualize everything you possibly can. Windows Server 2012 Hyper-V made that easier thanks to the possibility of virtual machines with 64 virtual processors, 1 TB RAM, and lots of 64 TB VHDX files.
That’s the technology end, but you will find complexity in the human element of the BCP. This is more of a business and company politics issue, but every decision must be clearly binary, every process must be documented, and every communication must be clear. Any time you hear “We should probably let Bob in Accounts know X because he might get upset” is when you need to raise your hand, profess your willingness to do what the boss wants, but explain the possible impact on the success of the BCP by adding complexity.
Why do (American) football teams train so much? It’s because they play a complex sport with many moving pieces and no time for communication. That sounds like a disaster to me! You cannot know if your BCP processes will work or if the DR replication system functions unless you test them. Ideally you will do this on a regular basis and involve all of the possible players that could be engaged when the real thing happens. This will mean rotating team members, not just the seniors, and getting executives involves because they will play a significant role during that dreaded day if it happens.
A good DR solution will allow you to test failover of your services to the secondary site without affecting production systems. This will allow you to verify that data is being replicated, that services will start up, and measure how long the BCP will really take to complete.
The BCP is a living document. Processes will be tuned, replaced, and rewritten based on your practice experience. People will be added or removed. And hopefully, with enough dry runs, you will have BCP veterans on board should a disaster really occur.
Your worst day of practice will be better than your best day in a disaster. As I just said, practice is essential to success, but people will be stressed when the real thing happens. Parents will be worried about children, people will be worried about spouses, transport will be chaos, and executives will be stressing out the IT staff that they suddenly realized the business depends upon.
Any BCP that relies on huge amounts of manual effort is doomed to fail. What we need is orchestration. For example:
There is no point in starting up web servers before anything else because those tiers in the services are not ready yet. Orchestration allows us to:
The role of the humans is:
That’s disaster recovery in a 2,000-word nutshell! This is a huge topic, and it’s why Hyper-V Replica created so much interest. Over the next few weeks I will spend some time introducing you to and making you an expert in Hyper-V Replica.