IT Service Continuity Management (ITSCM) : Explained
Contents
It's true for all modern operations: every business is at risk. The level of risk varies from a temporary system disruption to a full cyber-security disaster. Proactively assessing and planning for the unforeseen is critical to a company’s ability to bounce back. Ensuring business resilience starts with IT Service Continuity Management (ITSCM). Let’s take a deep dive into ITSCM.
What is IT Service Continuity Management?
ITSCM is a risk assessment and planning measure, and a key IT Infrastructure Library (ITIL) service delivery component. The focus of this service is planning for incident prediction, prevention, and management should a disaster-level incident occur.
On an operational level, the aim is to ensure services perform optimally when the inevitable incident occurs. This is achieved by having processes and procedures in place to restore services to their previous condition and status as quickly as possible. Maintaining service availability and ensuring the Confidentiality, Integrity, and Availability (CIA) of technology infrastructure and services while keeping performance at the highest possible levels before, during, and after a disaster-level incident is the goal of a successful ITSCM service. This plan plays a major role in the overall Business Continuity Plan (BCP).
Having a clear, well-documented plan can help a company easily navigate the multitude of factors that can impact the resumption of “normal services” after an incident.
Disaster Prepping
This idea can suggest canned food storage for a zombie apocalypse, but that’s not what we’re referring to. First, let’s define what we mean by a disaster. The Business Continuity Institute defines a disaster as: “A sudden unplanned event that causes great damage or serious loss to an organisation. It results in an organisation failing to provide critical business functions for some predetermined minimum period.”
Three of these key terms are subjective to each business: great damage or serious loss, critical business function, and minimum period of time. The first step in the ITSCM process is to define each of those variables in accordance with the business in question.
Great Damage or Serious Loss
Great damage or serious loss is usually measured in monetary value. The loss might take several forms such as an app going down, services being unavailable, reputational damage, or a loss in data, but the monetary value of the loss is the most common measure for ITSCM. For example, a commerce site may know exactly how much disruption will cost but may not be able to be precise about monetary value. Where exact values cannot be determined, which is often the case, organisations use their experience and judgment to determine the cost of a loss of service, over a given period.
Critical Business Function
Critical business functions are business activities or processes that must be restored in the event of a disruption to ensure the ability to protect the organisation’s assets, meet organisational needs, and satisfy regulations. Be careful when assessing critical business functions, as these functions may not always be what is anticipated. For example, let’s say “Payroll” is not a critical business function, as the example company’s revenue is not dependent upon Payroll. However, if the service goes down at month-end resulting in being unable to pay staff, how will this impact the business reputation and workforce morale? Be sure to assess all possible scenarios.
Minimum Period of Time
A business may decide it can stand a loss of four hours of service and put in place processes and procedures to work on any incidents to restore its service within this period. But what happens when it approaches four hours of no service and still no resolution is in sight? What’s the plan now? Should they halt rectification work and invoke the “disaster recovery” plan? Having a plan and a procedure is critical: who has the authority to invoke a recovery plan? How are they reachable? Are there any contracts guaranteeing a service won’t be down longer than a certain amount of time? Determining the answers to these questions is an integral step in the ITSCM plan.
ITSCM and Business Continuity Management (BCM)
Business Continuity Management is a process managed outside IT that identifies risks to the business and works to mitigate those risks. Some risks may be IT-related, including disaster-level incidents such as cyber-security threats, and some risks may be outside IT control, such as natural disasters or facility fires.
Since BCM encompasses ITSCM as well as other risk-mitigation processes, it makes sense for IT teams to work closely with the BCM team to create the following:
- Business continuity plan (BCP) - includes plans for prevention and recovery from disaster-level IT incidents
- Business impact analyses (BIA) - identifies the potential business impact of an IT disaster
Why does ITSCM matter?
Organisations with clear plans for service recovery will recover quicker and more fully in the case of prolonged service disruptions. ITSCM isn’t about planning for everyday outages. It’s about addressing worst-case scenarios and ensuring that if they happen, they will cause minimal disruption to the lives of both customers and employees.
The obvious benefits of a good ITSCM practice are:
- If disaster strikes, a good ITSCM plan means essential services will be back up and running quickly.
- Because the organisation is always prepared for a major incident, there will be minimised impact on revenue, reputation, and more.
- Everyone across the business understands what will happen in case of a major service disruption and how long they can expect systems to be down.
Incidents can occur at any time day or night, ranging from a mission-critical service being interrupted like an internet outage or an application crash, to something seemingly simple like a hard drive running out of capacity causing a cascade of events that eventually become equally “disastrous”. Having a well-documented recovery plan and procedures for service continuity will help mitigate any delays in restoring your services.
How does ITSCM work?
Step 1: Identify Risks
Obviously, not every risk can be covered, so it’s best to focus on things that are probable, given past experience and that of other similar businesses in the industry sector or physical location (determining which environmental risks may be most likely).
Identify the significant risks that are probable and create a list. To get started, let’s look at a list of common risks. At an annual or 3-year view, the following scenarios often pose a threat to businesses:
- Environmental Risks - hardware is at risk of floods, fires, powerlines being downed by high winds, etc.
- Malicious Risks - outside threats to security
- Planning Risks - sabbaticals, long term leave, etc.
- Succession Planning - people planning to leave at different stages of their careers, retiring, etc.
- Single Points of Failure (SPOF) – a flaw in the design, implementation, or configuration of a circuit or system. A SPOF will bring the entire system down in the event of a malfunction
- Obsolescence - technology obsolescence means large investments to stay current
Once a list of risks has been determined, the next step is to tabulate those risks on the Risk Matrix.
On the risk matrix, risks are measured by two indices: probability and impact.
Probability is the measure of the likelihood of an event happening. Since the probability will never be 0% or 100%, how should the likelihood of a future event be measured? A measure often used is the frequency of the event happening in the past. For example, the event in question may have never happened, happened rarely, happened a few times, or is always happening. Even if an event hasn’t happened to you, it may be a common occurrence in your industry sector, or physical location (ex: banking - hack target, flood plain location - flooding). Each business may take its own judgment and experience into account when determining probability, and these should be measured.
How is the impact measured? This is a variable metric from business to business, often measured as “loss of revenue”. Getting a rough estimate of the Potential Loss Per Hour is done by dividing the annual revenue the company generates by the number of working hours in the year.
Annual Revenue / Operational Hours in a Year = Potential Loss Per Hour
For example, in an office environment, operating 9 to 5, Monday to Friday, there are about 2000 hours a year to earn revenue. If the annual revenue is around 5 million a year, the potential loss per hour is roughly £2,500 per hour of service disruption.
This is a very basic tool as other factors such as reputational damage, contractual penalties, legal or regulatory fines, and loss of opportunity will add to the hourly loss rate and should be taken into account.
Step 2: Addressing the Risks
Once the key risks have been identified, there are several options open to address them:
- Avoid: Located in an area prone to coastal flooding? Make the risk probability as close to 0% as possible by moving the business location away from the coast.
- Reduce: Take action now to reduce the probability and lower the impact by installing flood defence
- Transfer: Give the problem to someone else! Usually, this is in the form of insurance coverage
- Share: Mitigate the impact by spreading the burden, usually within supply chain agreements, sector partnerships, neighbourhood agreements, etc.
- Accept: do nothing and accept the full impact of the event should it happen.
A business’ risk appetite determines the outcome of many safety decisions over time. This is a way to look at the risk vs reward system. A business may understand that a construction company coming in and digging through its cable connection to the outside world is very unlikely. They have assessed the cost of installing a second cable to mitigate this event, and due to a high appetite for risk, they decide not to install a second connection. However, another business may also consider the risk very unlikely, but their appetite for risk is low, and therefore they invest in mitigating that risk by paying for a second connection.
Step 3: Plan for the worst
This is the heart of ITSCM. When the key risks have been identified, determine which of those risks are IT-related. From a business perspective, the goal of ITSCM is to reduce the downtime, costs, and business impact of disaster-level incidents. On a more tactical level, objectives include:
- Working closely with BCM to protect overall business continuity
- Creating and managing plans for IT service continuity and recovery in case of disaster
- Working with vendors to minimise the impact of any downtime in their products and services, as it relates to the business
- Analysing risk and impact and revising plans accordingly over time
How Wicresoft Can Help
Our experts will work in partnership with the key stakeholders in an organisation to help them identify the main areas of risk and recommend how to plan for and address those areas. Wicresoft works with businesses to put together a well-documented recovery plan for the continuity of technology infrastructure and services with the objective of reducing downtime, unnecessary costs, and the business impact of disaster-level incidents. Most importantly, Wicresoft makes building your Business Resilience easy.