Act One The Gap

Every DR plan models known failures. Every serious outage is something else.

Disaster recovery documentation is built around scenarios you can name. Circuit A goes down, traffic fails over to circuit B. Carrier X has an exchange event, the SD-WAN reroutes. Power is lost at site, the generator picks up. These are useful. They are also the easy half of the problem.

In 14 years across IT and connectivity, the pattern I have seen is consistent: when an outage causes real business impact, the trigger is almost never in the runbook. A contractor cuts through the wrong duct. A power event takes out three exchanges in the same region simultaneously. A regional carrier has an incident no one had war-gamed because no one assumed both providers could be affected at once. Something happens that nobody specifically modelled - and the response team is working it out in real time, under pressure, with the business waiting.

A runbook tells your team what to do when X fails. A crash pack tells them what to do when they do not yet know what failed.

- Kyle Andrew, 14 years in IT and connectivity

For an IT director or CIO, this is the uncomfortable question worth asking: if the next outage does not match any scenario in our DR documentation, what do the first 60 minutes actually look like? If the answer is "we work it out," there is a gap. The crash pack is what closes it.

Act Two The Crash Pack

Aviate. Navigate. Communicate. In that order.

The framing is borrowed from aviation. When something goes wrong in a cockpit, the discipline is fixed: fly the aircraft first, work out where you are going second, talk to anyone else third. The order matters because reversing it - communicating before stabilising - is how problems compound.

The same logic applies the moment a connectivity event starts. The instinct is to fix what is broken. The discipline is to protect what is still working, decide what comes back first, and make ownership unambiguous - before anyone touches the failed component. The crash pack is the document that holds that discipline when the room is loud.

The Crash Pack
Three questions every IT leader should be able to answer in under 60 seconds
  • 1
    Aviate
    What have we got to maintain? Identify what is still functioning and protect it before doing anything else. The most common second failure during an incident is the response to the first one knocking something else over. Stabilise the surviving services. Do not let the recovery effort become the next outage.
  • 2
    Navigate
    What needs to be restored, and in what order? Ruthless prioritisation. The crash pack defines the service hierarchy before the pressure hits, not during it. Trading systems before file shares. Customer-facing voice before internal chat. Whatever the business actually depends on for the next two hours - in writing, agreed in advance, by someone with the authority to make that call.
  • 3
    Communicate
    Who is heading the situation? Not the role. The person. Named, contactable, and authorised to make calls without escalation. Unclear ownership is where mean time to recovery doubles - because three people are partially leading and none of them is fully accountable. The crash pack closes that gap on minute one.

If your team cannot answer these three questions in under 60 seconds the next time something happens, what you have is not a crash pack. It is a hope. And hope is what gets written into incident reports the morning after, alongside the recovery time the board is going to ask about.

Act Three What Sits Underneath

A plan is only as strong as the options it has to work with

A crash pack without underlying diversity is a document with nowhere to go. The plan can name the priority service, the owner, and the sequence - but if the network has no alternative path, the only available action is to wait. Real resilience needs deliberate technology choices at each site, chosen for the role they play in the plan, not chosen because they are the default product.

A tiered estate gives the crash pack room to act. Each site is assessed for what it actually does, what it must hold under failure, and which combination of technologies serves that role. The four below are the building blocks - selected and combined, not deployed uniformly.

Dedicated lease line
Hub primary

Uncontended, geo-redundant, contractual SLA up to 10Gbps. The right choice for hub sites and critical links where performance cannot vary.

FTTP
High-bandwidth secondary

Full fibre to the premises up to 1Gbps. Fast to provision, widely available. Strong as warm secondary on hub sites, primary on mid-tier locations.

4G and 5G
Carrier-diverse failover

Multi-IMSI networks at 100Mbps+. Carrier diversity by design. The go-to for rapid failover, out-of-band management, and remote or temporary sites.

SoGEA
Cost-tier branch

Copper-based broadband up to 80Mbps without a phone line. Cost-effective for smaller branches where full fibre is not yet available.

A hub site running a dedicated lease line as primary, FTTP as warm secondary, and 4G out-of-band for management access has no single point of failure - and gives the crash pack three distinct options to reach for. SD-WAN sits across the top, providing the management and control layer that orchestrates the switches. But SD-WAN is not resilience. It is the conductor. The circuits underneath are the orchestra. Without diversity in what is being managed, the management layer has nothing meaningful to do.

When you fail over, your attack surface changes

This is the part of the conversation that does not happen often enough at CIO level. The moment a primary circuit fails and the secondary takes over, the operational focus is entirely on restoring service. That is understandable. It is also the precise moment that security controls are most likely to slip - because a backup path that has not been held to the same standard as the primary is not a fallback. It is a gap in the perimeter that opened the second the failover ran.

The same firewall policy. The same DNS filtering. The same private APN on the 4G path. The same inspection on the FTTP secondary. If the security posture on the backup route is not equal to the primary, the business has restored connectivity at the cost of containment - and under live-incident pressure, that gap can stay open for hours longer than anyone intended.

Connectivity restored without security is not a recovery. It is an exposure. Every failover path in the design must hold the same security baseline as the primary, or the crash pack has just made the problem worse.

- GTN Security Practice

The crash pack must include the security check. Not as a step taken later, after the connectivity is back - as a confirmation made before the failover path is left running. Equal posture across every route the network can reach for. Anything less is a recovery the business will regret in the post-incident review.

The board-level question

If the building does not exist tomorrow - what is the plan?

It is the question that cuts through every theoretical resilience discussion. Not "what happens if a circuit fails," but "what happens if the physical site is gone?" The crash pack is what turns that question from rhetorical to operational - because the answer cannot be assembled in the room while it is being asked.

GTN works with IT leaders to pressure-test the plan before it has to work for real. Reviewing the design, the failover paths, the security posture across each route, and the sequence the team will actually follow when the first 60 minutes are the ones that matter. The conversation starts with what is critical to the business - and what the infrastructure in place is genuinely capable of when the scenario goes off script.

Pressure-test your crash pack with GTN