Testing the limits
Written by Peter Davy
When Hurricane Sandy hit, a lack of confidence in the NYSE backup plan left the exchange unable to invoke, leading to the longest shutdown after bad weather since 1888. But every cloud has a silver lining, writes Peter Davy
Every cloud has a silver lining. So, while last October’s Hurricane Sandy shut the New York Stock Exchange for the longest period since 9/11, it did provide an opportunity to review disaster recovery on Wall Street, according to the exchanges operator NYSE Euronext.
According to reports in March, the company is preparing to submit details to regulators of a new disaster recovery plan focused on operating without human traders. It has also suggested ex-changes should consider whether to make regular testing of connectivity and backup facilities mandatory.
The changes come as a result of the exchange being unexpectedly badly hit by the hurricane. Right up to a few days before it struck, the company maintained it planned to continue operating as usual. In fact, a lack of confidence in its backup plan ultimately meant it called off invoking it, leading to the longest shutdown after bad weather since 1888.
As a company spokesman told reporters, there were lessons to be learned: “At a minimum, businesses learned the importance of a well-prepared and tested business continuity plan.”
For industry experts, though, it’s a lesson that’s proved stubbornly hard to get across. “There are many, many organisations that either haven’t done any form of testing or, if they have, have not tested adequately,” says Mark Parnell-Hopkinson, managing director of Sentronex, which provides disaster recovery, hosting and cloud services. “They really don’t test their plans enough – either the technology failover or the processes.”
That makes it impossible for companies to trust in their plans even if they are efficient. As Chris Needham-Bennett, owner of business continuity consultants Needhams 1834, puts it: “If you don’t rehearse it you won’t have any confidence in it.”
The persistent failure to do regular testing is more confusing given advances in technology, particularly cloud computing and virtualisation. Uptake of both has surged lately. A survey of 1,300 IT decision makers by TechTarget in March, for example, found usage of the cloud for disaster recovery and business continuity expected to more than double from 17.9 per cent now to 28.5 per cent in just six months.
Both technologies make not only backing up data easier, but also testing potential much less fraught. Virtualisation, for example, which effectively separates the hardware from the system running on it, enables business to test a cut over to a backup system, without shutting down the original. That’s a big help since fear of potential disruption has been a major reason why many businesses have been reluctant to test disaster recovery plans in the past.
“Now we can bring up replicas of the customer’s system in a sandbox, in a region where it can’t do any damage, and test it there,” explains Tim Dunger, managing director of Plan B Disaster Recovery. The cloud, meanwhile, makes recovery of data much simpler and quicker.
A number of barriers remain, however. One is that, despite the growth, uptake of the new technology is still a minority sport. TechTarget’s survey found 43 per cent still shipped data to another physical recovery site and 34 per cent transferred it to tape.
“It’s still a journey for many,” says Deloitte’s head of business continuity Rick Cudworth. “Many organisations still have old and legacy systems.”
Parnell-Hopkinson agrees: “A lot of people in non-technical roles making decisions are still five to 10 years behind the technology in terms of what is actually possible.”
Even among those who had taken up the new technology, however, only 25 per cent in Tech-Target’s survey reported they had tested the cloud disaster recovery service. Likewise, another recent survey by data recovery specialist Kroll Ontrack and cloud and virtualisation software group VMWare found only a third of organisations of organisations using the cloud or virtualisation tested data recover.
One big problem, says Ron Miller, principal consultant at SunGard Availability Services, is a cultural one. “The issue is getting buy-in from senior managers,” he argues. “Too often, they see exercising and testing as an overhead rather than a benefit to the organisation.”
Exposing the gaps
In fact, one of the major benefits of testing is that it forces businesses to focus more broadly than just the technology. Georgine Thorburn, managing director of Document SOS, which specialises in disaster recovery for hard copy, says companies often overlook the importance of their paper-based files – which in some respects have grown in importance as technology has advanced.
More generally, it enables businesses to test their processes, revealing weaknesses that aren’t immediately apparent in the plan. Some become obvious very quickly.
Dunger recalls one test that began with ringing the fire alarm, and staff assembling outside. When told they were to carry on work for the rest of the day from home and the company’s alternative locations, the first thing workers did was to head back to the building to pick up their laptops and bags.
More often it’s the little things. A common problem is that workers don’t have the numbers they need at their remote location, because they usually rely on the speed dial function of their office phone, says Needham-Bennett: “It is that sort of painful detail.”
“There are always lessons that come out of a test,” says Thorburn.
As for how often the plan should be tested, there’s no hard and fast rule, and it will vary by business. There are practical implications of a test. Cudworth points out that although a failure to test and a lack of confidence is one reason businesses are reluctant to invoke their disaster recovery plans, another is the inevitable implications.
“Most companies will not have invested in a like-for-like disaster recovery environment. So their production environment, live environment and disaster recovery environment will not be exactly the same,” he says. “Any invocation of disaster recovery therefore has an impact on operations.”
Nevertheless, Thorburn reckons testing should be done twice a year. Mike Osborne, managing director of business continuity at Phoenix, says once at the very least. Any less regularly than that and staff turnover, changing roles and changing circumstances mean it becomes increasingly likely a vital link in the plan will have been lost without anyone noticing.
“Any plan that hasn’t been tested for more than 12 months is likely to be decaying very, very quickly,” warns Osborne.
Similarly, it is hard to prescribe exactly what form tests should take. Again, there’s a balance to be struck between rigour and practicality. Needham-Bennett says often regular “ten per cent tests” are enough – a tenth of the workforce operate from the recovery site.
“If it works with ten per cent using the remote machines, it is not unreasonable to think the rest of them will work as well.”
Sink or swim
For all that, there are limits to what can be achieved. SunGard’s whitepaper on the lessons from Sandy noted that even those that had just had a good test risked seeing their recoveries fail in the event of a disaster.
Planning for tests, it observed, was on average a 12-week process: “No hurricane provides that kind of advance notice.”
The result is that tests can lead to overconfidence, according to Dunger.
“People tend to be over-optimistic about what they can achieve in a disaster,” he says. “They think about what they would need to do in an ideal situation where there is no chaos and everybody thinks clearly. In reality, when a disaster happens you are in a very unpredictable environment.”
Thorburn agrees: “Most plans focus on removal to a location with empty desks, and you have to make a booking, so everybody is on their best behaviour.”
There are also other challenges that won’t be revealed by a rehearsal. Awareness of the strain on mobile phone networks during crises has grown in recent years, but largely as a result of actual incidents, such as the London bombings, according to Nigel Gray, a director at PageOne Communications, which offers a range of business communications solutions.
“You are never going to quite replicate that, but I think people are more conscious of the problems now,” he says. Nevertheless, there are still many companies heavily reliant on one or two communication technologies that would be vulnerable in a major incident.
Testing, therefore, has its limits. The alternative, however, is much worse.
It’s not just that it weakens the disaster recovery plan, says Osborne; in some respects it’s worse than not having a plan. He likens it to being on a sinking ship and unable to swim. “If you know you can’t swim, then when you jump overboard you make sure you have grabbed hold of a lifebelt,” he says. “If you have a plan that hasn’t been tested, though, then you think you can swim. It’s only when you hit the water you realise you can’t.”
Download this article as a PDF
Contact the editor