Having worked in IT for 20 years now, I’ve learned the hard way that IT can fail at the most inappropriate times.
It doesn’t wait until large projects are complete or on hold. It doesn’t wait until you’ve got the first couple of hours of quiet time in a month when everything is under some sort of control. IT failures typically strike at the exact point in time that you would really rather they didn’t.
Because of this, I’ve used a concept of build planning for failure for quite some time – and not in case of failure. In my various roles in IT (including in my work with IT solutions and services provider Clovertec), I’ve been either architecting, building or consulting on infrastructure designs; and doing so with a mindset that things will fail has produced a far more robust system.
In cases where something has still failed, the general result has been much faster recovery times, based on the fact that the design was done with failure in mind. There is also a recovery plan already in place with some scenario-based concepts of what to do in the event of failure. I recall a recent software incident on a virtualised server that resulted in services being down. While the hardware layer was still operational due to multiple levels of redundancy, a corruption of software caused an interruption. In this example the recovery plan detailed various options via Veeam Backup & Replication from guest file recovery to full failover to a replica server. The decision and final direction on what to action took longer than the actual recovery, but overall systems were restored in less than 10 minutes from a critical IT failure.
This approach undoubtedly pushes the investment required up. However, it’s a case of understanding what you are working with, what the increased costs are and what the operational impact is when things go wrong.
Armed with all this information, you can make educated decisions on what is appropriate.
I’d now like to talk a little about what to do when your IT fails. For me, the most important thing is to remain calm and not act in haste. I’ve seen it far too many times when a rushed decision has actually been the wrong one and resulted in far more downtime than there should have been.
5 steps to managing an IT failure:
Step 1: Information gathering
Understanding what has happened, who and how many people are affected, the operational and financial impact per hour of downtime, if there’s going to be data loss and if so, how much crucial pieces of information to gather after an IT failure.
What’s more, you’ll need to find out if there are any stakeholders that need to be brought into the recovery plan from an operational and financial decision-making point of view.
Step 2: Planning
Now equipped with all that information you can, it’s time to start planning the steps required at a fairly high level.
It’s also worth some consideration at this point anything that you might want to mitigate against, have a roll back plan for and any actions that might be required around change control.
There could be a process within the organisation around emergency change control and how this can bypass the normal approval process.
Step 3: Additional details
If any of the high-level steps require further detail or consideration, now is the time to do it. Once there is a plan of action in place, discuss this with anyone that is appropriate. I wouldn’t advise going at this alone – an IT solutions and services provider will be able to support you with this.
Step 4: Implementation
At this point, the pressure and stress will be mounting as all these people will be looking to you to save the day.
Remember to remain calm. Hand out tasks within the recovery plan where possible and (this bit is crucial) try to document in detail what you are doing.
This will help twofold; firstly, if you need to roll anything back, you’ll know what you have changed, and secondly, once the dust has settled, you’ll want to update your IT documentation.
Step 5: Reflection
You need to take time relatively soon after the event to look back at what happened, why it happened, what action you took, what you have learnt, and what steps you need to take to prevent a similar scenario happening again in the future.
All the above shouldn’t be considered for too long, as these steps are being taken while systems are down. It should all be done within the first hour and possibly less for the more minor disruptions.
Information is really the key bit to take away from this blog. After all, information is power and creates a foundation for the best decisions to be made. Without it, you’ll be acting blind and could end up making things worse.
This is where an IT solutions and services provider like Clovertec can help. Our dedicated team can provide rapid resolutions to any IT issue and have done so proudly over the last twelve years!
If you’d like any further information about this or would like to discuss how we can support you, please don’t hesitate to get in touch.