Published: May 22, 2023
If most organizations understand the need to have a robust disaster recovery (DR) plan, IT Pros should also know that testing these plans is essential. If you don’t implement disaster recovery testing, you won’t know if your plans will work when you need them. And to properly test your DR plans, you may want to break them.
Some notable companies, like Google, are famed for implementing their entire DR procedures regularly – switching their production workloads to a backup infrastructure and then switching it all back again. This ensures that the DR plans work and that they are continually updated.
While that level of testing is beyond most organizations, it does underline the importance some businesses place on testing.
There’s no doubt that regular disaster recovery testing is important. However, one thing that you can always count on is Murphy’s Law: If something can go wrong, it will go wrong.
As an IT Pro, DR testing can avoid the biggest and most obvious problems and errors that you might encounter. However, what about those other unexpected things that can potentially cause disaster recovery failures?
While you can’t control everything, one situation you don’t want to encounter while in the midst of attempting a recovery is an unexpected error. When you consider the situation, which prompted the disaster recovery in the first place, it might not be unreasonable to expect to run into other problems.
To uncover how your plan will respond to different error conditions, add stress components. Stress testing means testing critical parts of your plan under less-than-ideal conditions, or even in a compromised environment.
The idea is to place unusual stress on critical elements of your plan to discover what unexpected problems or issues might be encountered, thereby enabling you to better prepare for and deal with a wider range of recovery scenarios.
In other disciplines, like manufacturing, this method has been called destructive testing or sometimes destructive physical analysis (DPA).
Destructive testing is used to test and better understand the performance or behavior of a component, material, or machine by determining its specific point of failure.
In this process, the target material is subject to continuous specific stresses until it fails. For physical targets, tests and failures are often recorded using high-speed cameras. This type of testing is used for aircraft and automobiles as well as various metals, steels, and construction materials.
The benefits of destructive testing are that it allows you to know the acceptable operating parameters of your product. It also helps you to understand the conditions that will cause it to fail and those conditions it can withstand.
Using destructive testing, you attempt to cause software failure in a controlled manner to establish its robustness. And to understand the range limits within which the software will operate in a stable and reliable manner.
For example, while you are testing parts of your disaster recovery plan, consider simulating a network outage, a denial-of service (DoS) attack, or an intermittent connectivity issue. Alternatively, you could stress your backend storage system or introduce a user error, such as not following the prescribed recovery path.
The purpose of this type of testing is to establish what levels of interference your DR plans will tolerate and what will cause them to break. While it may seem counterintuitive, breaking your plans can be a good way to make them more predictable and reliable.
By selectively performing destructive testing at various critical junctures of your DR processes, you can better understand the tolerances that your plans have and the possibility to make them more robust.