Defining Metrics for Problem Management
Many people define KPIs for their IT service management processes by looking in books (such as ITIL Service Operation) or by copying metrics that other organizations use. This is rarely going to give good results, because KPIs need to INDICATE the PERFORMANCE of the KEY things you care about (that’s why they’re called Key Performance Indicators). In the worst cases I have seen ITSM processes with huge numbers of so-called KPIs that are measured and reported even though nobody uses the values to drive any changes in behaviour or improvements in business outcomes.
I recently wrote a blog titled Defining Metrics for Change Management in which I explained how you can create KPIs that support what you are trying to achieve. A number of people contacted me after reading that blog to ask for examples of how to derive KPIs for other ITSM processes. I decided to write this blog about problem management KPIs because this is one process where many organizations I have worked with had very poor KPIs. Remember you shouldn’t just copy the outcomes, critical success factors (CSFs) and KPIs that I am describing here, you should use them to understand the approach and methodology I have used, and then think about what is important to you and derive metrics that measure the things you care about.
The first step to defining good KPIs is to identify the objectives of problem management, what outcomes does problem management help us to achieve? For me there are two key outcomes of a good problem management process:
- Reducing the number of incidents that occur
- Reducing the business impact of incidents that can’t be avoided
We could just measure the number of incidents and the overall business impact of incidents. These would certainly be valuable things to know, but I’m not sure they’d show how well problem management has been working, because so many other factors could have contributed. So I will break these down a bit and identify some problem management CSFs that could contribute to these outcomes:
- Identify problems that have caused multiple incidents
- Implement workarounds that reduce the impact of incidents
- Initiate changes that reduce the number of incidents
It’s worth noting that I didn’t mention root cause analysis (RCA). I see many problem management people who only think about RCA, but this doesn’t actually deliver any benefit, it’s just a technique that we use in problem management. The worst problem management KPIs that I see are “Average time to root cause”, “Percentage of problems with RCA complete in 3 days”, or similar. These KPIs drive behaviours that we really don’t want, by encouraging problem management people to declare that they have found “the” root cause of a complex situation rather than continuing to analyse and understand it even after they have identified one significant contributory factor.
One of my customers has a process for prioritising problems that takes account of the frequency and business impact of the problem, including the mitigation provided by any workarounds that are in place. They then have a KPI of “Average time to reduce problems to P3 priority.” This reduction can be achieved by resolving the problem, or by implementing a good workaround. The point is that they are measuring problem management based on how well they are reducing pain to the business. I’m not going to suggest that KPI here because it requires a fairly sophisticated approach to problem prioritisation, which not many IT organizations can achieve, but if you can measure this then it’s certainly something you could think about.
Here are some suggested KPIs that might help to demonstrate the CSFs I have listed above. Remember you shouldn’t just copy these – use a similar process to identify KPIs that will measure what you care about.
CSF1 - Identify problems that have caused multiple incidents
- Increased percentage of incidents associated with a problem record or known error
- Top 5 problem report created every month
CSF2 - Implement workarounds that reduce the impact of incidents
- Increased percentage of incidents for which a knowledge base article provided the solution
- Increased percentage of incidents closed by users using self-service incident management
- Reduced impact of incidents associated with previous months’ top 5 problems
CSF3 - Initiate changes that reduce the number of incidents
- Reduced number of incidents associated with previous months’ top 5 problems
- Reduced backlog of outstanding problems
I have worded these KPIs as “Increased…” or “Reduced…” because I don’t have the data needed to set explicit targets. As you make use of metrics like these you can put in place numerical targets, based on the baseline that you create when you first start measuring and reporting.
How well do your problem management metrics measure what your customers care about? Is it time to review your problem management KPIs and align them with your CSFs and objectives?