Problem-solving and particularly the part that focuses on root cause analysis (RCA) has always been one of the topics that has had my special interest. Specifically, two questions always slumbered in my head, viz. (1) whether you could speak of one root cause, or that you should speak of multiple (root) causes; and (2) whether you should speak of the root cause or rather the root condition. This is the final post in a series of six posts, in which I have tried to explain how rigorous problem-solving logic (using an example) can help us answer these questions. At the same time, I hope the example and the logic will be useful in your own problem-solving efforts or your coaching thereof. This sixth and final post will summarize the analysis, say something about prioritizing counter-measures and conclude on the two questions. I will end the series with some words on how this can help you in your problem-solving efforts and your coaching of problem-solving teams.
In the first post, I set the stage by defining the starting points and some basic initial concepts (like problem, cause, agent, target, event and Tripod Beta’s causal diagramming technique). I also introduced an example that I will use throughout the series. In the second post, I added some more concepts (necessary condition, defensive and control barriers). The third post traced back the causal event chain while introducing and applying concepts like causing events, the initial causing event, and the initial active cause. In the fourth post, we descended to the systemic level of our problem. We further explored the concept of barriers and made the link to standards in Lean thinking. And I introduced the problems of occurrence and non-detection and how this shows us the direction in which to find the root of the problem. In the fifth post, we again went into the causal event chain, but now at the systemic level. And we discussed the problem of people not adhering to the standard.
In this sixth and final post, I will summarize the analysis, say something about prioritizing counter-measures and conclude on the two questions. I will end the series with some words on how this can help you in your problem-solving efforts and your coaching of problem-solving teams.
Working on Systemic Problems
At this moment in the problem investigation we can summarize the problem analysis with the following ten key points:
Specific level:
Problem: person with a head injury
Specific level causing event chain:
1. Prime event: person’s head hits floor
2. Initial active cause: rolling pipes
3. Initial causing event: pipe loaded onto pipe stack
4. Necessary condition: wrong vehicle (without stakes) in use during loading
5. Problem of occurrence: not adhered to the standard of using a flatbed with stakes
6. Problem of non-detection: no pre-loading check of vehicle to be used
Systemic level:
Problem (systemic): not adhered to the standard of using a flatbed with stakes
Systemic level causing event chain:
7. Necessary condition cause: schedule contained the wrong vehicle
8. Initial causing event: scheduler selected the wrong vehicle
9. Problem of occurrence: not adhered to loading standard prescribing flatbed with stakes
10. Problem of non-detection: no self-inspection of schedule by scheduler nor any input poka-yoke/error-proofing in scheduling system
Figure: overall causal diagram for the example.
Back to TriPod Beta, I do want to emphasize that in Lean we are trying to improve our system of work through the establishment and improvement of our standards. This means we should focus on elements that are part of the organization’s system of work (and not outside) and that can be created or improved by the organization. Furthermore, as we have said that we do not seek to blame the individual person, we should, therefore, be careful to wander too far off into the pre-conditions and always verify whether they are still related to our system of work (our standards).
Non-adherence, in my opinion, can only be prevented when there is an effective standard for detection that will keep a person from making an error, i.e., a control poka-yoke. If we cannot conceive a control poka-yoke, you can only minimize the risk. Error-proofing standards, for me, also represents an aspect of respect for humanity in Lean thinking.
Please note that causes at the specific, actual level are active agents or targets. At the systemic level, however, causes are conditions. From that point of view, you could even better speak of a root condition and of root condition analysis (RCA) instead of root cause and root cause analysis.
The Root Condition
Based upon this, the systemic root condition can now be defined as:
- a (partially) unavailable or inadequate (in the sense of incorrect) standard related to the occurrence of the necessary condition cause, or
- a (partially) unavailable or inadequate (in the sense of ineffective) standard related to the non-detection of the necessary condition cause when there is a standard for non-occurrence that is both available and adequate (i.e., correct).
To complete the problem or root condition analysis (RCA) of our example, therefore, we can now conclude that (a) as there was a correct loading standard prescribing the correct vehicle (a flatbed with stakes) and (b) we want to focus on our system of work and not the person, the systemic root condition of the head injury was a missing error-proofing check on the schedule that should have detected the wrong vehicle that was in the schedule.
Establishing Other Systemic Controls and Defenses
Now it is time to come back to the necessary causes and the barriers that are downstream. We already concluded that an effective barrier downstream (for instance, the hard hat) could have prevented the problem. The hard hat is an example of a defensive barrier. Other controls and defenses (systematic or individual) can also effectively stop the causal chain before it arrives at the prime event that was the trigger for the current problem investigation.
Look at the list of systemic problems that were found and that could help us improve our system of work:
a) No protective mat for the hard floor
b) No check on safety (hard hat) compliance during loading
c) No railing (or other means) on vehicle to prevent person from falling
d) No clear safety policy on personnel standing on a vehicle during or after loading
e) No pipe stops on vehicle
f) No pre-check on vehicle type before loading activity starts
g) No visuals on vehicle to aid in detecting deviation from required standard
h) No cross-check on vehicle in schedule by the receiving Team Lead
i) No self-inspection of the schedule by the scheduler
j) No error-proofing check (poka-yoke) by scheduling system on vehicle selection
If the team leader would have had and respected a pre-loading check, if the vehicle would have had visuals to indicate its scope of application, if there would have been a policy for people not to stand on a vehicle while loading pipes (or even after) that would have been respected, if there would have been a railing on the vehicle, if… A lot of “if’s”. Any of these if’s could have prevented the prime event. These barriers can be seen as the protection layers in a LOPA analysis (LOPA: Layers of Protection Analysis) that mitigate the risks. This doesn’t mean, however, that the problem at the source did not exist. It did. But we would not have uncovered the problem at the source.
Hence the name Root Condition Analysis (RCA): you track back the chain until you get to the source. And don’t let yourself get distracted along the journey. This is an important role for the coach of the problem-solving team. The coach has to keep the team focused and on track.
The explained logic also supports Heinrich’s well-known “safety pyramid” (Heinrich’s Law: in a workplace, for every accident that causes a major injury, there are 29 accidents that cause minor injuries and 300 accidents that cause no injuries).
Figure: Heinrich’s well-known safety pyramid.
And this just as well applies to all other problems. Among other things, Lean builds a visual control system to detect these deviations from standard as early as possible, to immediately stop problem propagation and thus to give ourselves more time to eradicate the problem before it impacts personnel or customers.
Source Control
The reason why it is important to focus on the causal chain of causing events is that it always more effective and definitely more efficient to control a problem upstream of its causal chain, preferably at its source. Just as Ritsuo Shingo recently said during a session that I had the pleasure of attending: “it is easiest to cross a river at its source”.
In the case we discussed, it would have been enough to stop the further chain of events if the scheduler would not have been allowed by the system to select the wrong vehicle. All counter-measures downstream would theoretically not even be required if we could prevent this initial causing event from happening using a proper standard of detection (which we identified as the root condition).
And again – don’t get me wrong – eliminating necessary conditions downstream will effectively prevent the problem. But it is too easy. By working on controls and defenses downstream of the causal chain, we will not uncover the underlying problems in our basic system of work. And we will not prevent intermediate problems to develop. And intermediate problems are also problems that we don’t want in our system of work, even if the consequences thereof are not always as severe as a head injury. A standard is the currently known best way to perform our work. A deviation from that standard (a problem) always has negative consequences, otherwise, it wouldn’t be the currently known best way. Working on system problems at the source will eliminate many actual problems at the same time. And that through the eradication of maybe only one actual problem. I hope you can see the multiplying effect of proper root condition analysis.
Logically Coaching Problem-Solving Teams
From Logic to Method and Rules
If you try to summarize the logic presented in this series in a “method”, i.e., a series of steps or actions accompanied by rules, the key ingredients of the method would be the following:
- Start with describing the actual, specific problem in the form of an actual state versus the desired state of a target. Use the target and the actual state in your wording, e.g., the person’s injured head.
- Define the prime event in which the target was changed into the target in the problem state by the agent. Use both agent and target connected by a verb in the description of the event, e.g., the person’s head hits the hard floor.
- Identify which path to take further upstream: agent or target? At the specific problem level, this is the agent or target that was active in the causing event from which you depart. We defined that as the cause. So, you follow the path of the cause. In our case, the floor was passive (a condition); it was the target (the person) that was active, so you follow that route.
- The result of the preceding causing event is again a problem. So, in fact, you are back at step 1. Describe the state of the active agent or active target (the cause) as a problem, i.e., as a deviation from a desirable state as also explained under the first point. In our case, the person is falling.
- You continue to move upstream like this until there is no active agent or active target that is in a problematic state (so consequently there is no cause upstream). In our case, the pipe that was being loaded was active but did not in itself represent a problem.
- The problematic state following the initial causing event can then be defined as the initial active cause of the problem that came into existence in the prime event. In our case, the rolling pipes were the initial active cause of the head injury.
- To not get distracted while tracing back the path to the initial active cause, I recommend to only start adding (unavailable, not-adhered to and inadequate) barriers at this stage of the investigation. You could, however, also add these event-by-event while you progress through the event chain. But you do have to make sure that the team doesn’t get lost in too much discussion on downstream barriers at this stage of the investigation. We first need to get to the initial active cause.
- Now it is time to move on to the systemic level. You take another look at the initial active cause and ask yourself what passive, problematic situation (or what necessary condition) was already in place before the initial active cause was produced. Put differently, you look at which unavailable, inadequate or breached barrier allowed the necessary condition to exist. You thereby take the path of the problem of occurrence, i.e., the relation (or arrow if you will) between the necessary condition and the initial causing event. In the example, this was the wrong vehicle (without stakes as a barrier) that was in use during loading.
- You then re-start another causal event chain (like in step 1.), taking the necessary condition as the starting point. You identify the causing event that led to this necessary condition including the (active) agent and the (active) target. In our case, it was the (active) team leader taking the wrong, (passive) vehicle before loading commenced.
- Just like at the level of the actual, specific problem you continue to trace back the causal event chain of the necessary condition to its initial causing event, which we called the necessary condition cause. In the example, it was the schedule with the wrong flatbed selected.
- Just like in step 8., you look at which unavailable, inadequate or breached barrier allowed the necessary condition cause to come into existence, again focusing on the problem of occurrence. In our example, we found that the loading standard during was not respected during scheduling that allowed the wrong schedule to occur.
- You finalize with identifying the root condition. The root condition is either an unavailable or inadequate standard that gave rise to the necessary condition cause (a problem of occurrence), or – in case of non-adherence – an unavailable or inadequate standard that should have protected the target to be impacted by the agent.
Figure: the overview of the method to get to the root condition of a problem.
Multiple Root Causes or Single Root Condition?
Let’s also go back to my initial two questions:
- Is there one root cause or are there multiple (root) causes?
- Should we speak of the root cause or the root condition?
On the first question, my answer would be that we should speak of one rather than many. If we rigorously follow the presented logic, each event has either an active cause (agent or target) or – when we arrive at the initial event – a condition related to a standard. A for the condition, we have either an unavailable or incorrect standard and if not, we have an unavailable or ineffective standard for detection. In any case, I always arrive at one.
When teams speak of multiple root causes, very often they jumped to conclusions and interpreted downstream conditions as being a cause. Sometimes, I also witness a lack of sufficient rigor in the problem-solving process whereby brainstorming about hypothetical causes instead of rigorous logic and verification of the evidence. The list of all possible causes may quickly lead to the idea of multiple causes being present at the same time and interacting to produce the undesirable outcome.
On the second question, I think we can safely conclude that a root cause, in fact, is a root condition. We have seen that a cause is an active agent acting upon a target, or an active target exposing itself to an agent in a causing event. But when we search for systemic problems – problems in our system of work – we are trying to identify situations or states that allow causing events to happen. The root “cause” (or better: condition) was defined as (1) a (partially) unavailable or inadequate (in the sense of incorrect) standard related to the occurrence of the necessary condition cause, or (2) a (partially) unavailable or inadequate (in the sense of ineffective) standard related to the non-detection of the necessary condition cause when there is a standard for non-occurrence that is both available and adequate (i.e., correct).
Implications for Problem-Solving Teams and Coaches
Although this series of post most probably will be seen by many as quite theoretical, I sincerely hope there will be enough people that have invested their time in reading the whole series. And they have tried to follow the thinking behind it. If you did, I welcome your feedback and thoughts to further improve our thinking.
As said, I am convinced logic and rigorous verification of evidence are two key ingredients to effective and efficient problem-solving. Managers and coaches that guide problem-solving teams should, therefore, be very much aware of the concepts and the logic underlying RCA. It will help teams be more effective as well as more efficient in eradicating their problems. And it will help managers and coaches to become better managers and coaches.