Managing High Risk IT Incidents
Back to school is a harrowing time for almost anyone involved in delivering technology-based services in education. At this time of the year, Unicon often receives calls from business or IT leaders desperate for additional help in resolving a technology-related crisis. Unicon staff has extensive experience resolving application, performance, and infrastructure-related incidents. We have distilled a set of best practices for managing through high risk/high visibility IT incidents. While there are extensive best practices to avoid technology failures, it is inevitable that problems will arise and eventually you will be called on to work through a crisis. This post will highlight selected crisis management best practices to minimize impact and restore service levels more quickly.
- Treat everyone with respect - Even when tensions run high, be kind, courteous and thankful to everyone involved. Yelling at the poor vendor support tech or your own staff will not help solve the problem any faster. Worse, disrespectful actions and words break trust bonds, reduce commitment levels, and can accelerate staff departures when the crisis is over. Send thank you notes to vendor staff, internal staff, and even spouses. Long running incidents can take a toll on other relationships and if you acknowledge that and demonstrate appreciation, team strength, and commitment, cohesion will grow instead of erode. Also, don't play the blame game during the incident - there will be a time for root cause analysis and lessons learned. Involve all the vendors or team members necessary and do so with a partnering approach.
- Maintain calm - Panic and chaos will not speed resolution. Leadership, both technical and business/academic, needs to remain outwardly calm. It is advisable to recognize the severity and impact of the incident, but avoid the "we're all going to get fired" hyperbole.
- Get expert help fast - "Expert help" can be internal and external resources. Know who your battlefield commanders are - these are the people that have extensive experience in your environment and with prior crisis or incident resolution experience. These are the people that have excellent triage and diagnostic skills, are often methodical in their thinking, and always seem to demonstrate an intuitive understanding of where to focus. Do not hesitate to call in outside resources - either vendors or other domain experts. Often, it is a useful sanity check just to confirm that the internal staff is focused on the right areas and in many cases outside perspectives can identify other creative solutions.
- Move fast, maintain rigor - Do not get stuck in analysis paralysis. Move quickly from triage (characterizing the problem) to action. Often, internal debates arise regarding the cause, there are conflicting symptoms or data that further mask the issue, or multiple failures/issues are at play. Resolution typically follows three steps - triage/data gathering, decision making (what do we do), and taking action. In complex incidents, that cycle may continue for quite some time. When the decision-making debate begins to stall progress, shift action goals: Is there a proposed action that might resolve the issue, but if not will give new insight? If there is debate over a proposed action, turn it in to an opportunity to learn more about the issue. The best "battlefield commanders"/crisis managers do this instinctively, but you can also train your teams to act this way. Also, do not let the need for speed short circuit your processes - accelerate them. The incident team becomes a special case of the Change Advisory Board with changes approved and recorded in minutes, not hours or days. Run code changes through functional testing to not compound the problem by introducing additional issues. If you need technology execs on the team to accept accountability and accelerate decision-making, make sure they're involved or have clearly delegated decision-making authority.
- "Project Manage" the incident - Taking a project management (PM) approach is critical when the crisis starts to span multiple days. Maintain (revise and evolve) an action plan that is prioritized with the tasks most likely to resolve or improve the situation. Pursue multiple solutions in parallel and manage resources to do so. For complex issues, many teams may be involved: vendors, development, QA, performance testing, and infrastructure/ops. Coordinating the activities to get potential fixes into production requires PM skills. Set clear goals for the teams working the issue, e.g. deploy a given code fix by 8AM tomorrow. Also, schedule regular check point meetings (daily for long incidents) to review actions taken, the results and/or data, review hypotheses and proposals, and communicate the re-prioritized goals and actions. Ensure there is adequate and clear decision making to avoid deadlock on priorities. Lastly, manage people to avoid burn-out by recognizing who has not gotten enough rest or is showing signs of burn-out, plan "shift changes" and knowledge transfer, and recognize in advance when other skills and staff will be needed (e.g. if a patch is coming from internal dev or a vendor, prep the deployment team).
- Communicate - Openness and transparency with stakeholders and customers is always the best policy. Recognize the impact on your stakeholders, confirm that the issue has the full attention of the team, and provide some visibility into the actions being taken to restore service. Also, be candid about when the issue will be resolved - highlight the actions underway and the expected outcomes, but if resolution is still unclear be honest about that. Ensure that technology leadership is visible, present, and available for questions.
As noted above, IT crises are inevitable, even with the best teams and infrastructure. How a technology team responds in a crisis can materially impact the trust and commitment level of customers/clients, user communities, and even the technology team itself. Great crisis management can actually improve customer trust and loyalty as well as IT team cohesion. The best practices summarized above have been developed by Unicon's staff which collectively has extensive experience managing through high risk/high visibility incidents. We hope that you can apply these in your own environment. We would also welcome the opportunity to have more focused discussions regarding these topics. Feel free to contact Unicon at www.unicon.net/contact.