Remember the title of this blog has the word ‘BASICS’ in it, so we are sticking to the fundamentals like blocking and tackling. Everyone has their own way to use such cause-and-effect tree expressions and I am just conveying my preferred one which is consistent with the PROACT RCA Methodology.
Cause-and-Effect Logic: from level to level, it represents a cause-and-effect relationship. This does not have to be a linear relationship as there may be multiple causes that have to occur at the same time, in order to create that effect. We just need to know that we are simply creating a graphical expression of logic to reflect the facts that occurred, to cause an undesirable outcome.
Event: The reason you care! What brought this incident to your attention? Many believe that we do RCA on incidents themselves, but I believe we do RCA on their consequences. Think about it at your place, there is usually a business level reason we do RCA…injury/fatality, certain $ production loss exceeded, certain $ maintenance cost exceeded, regulatory violation and the like (often called triggers). Sound familiar? These are the known FACTS.
Failure Mode: These are the typical things we normally start an RCA with, like Pump Failure, Injury, Loss of Production, Environmental Excursion, etc. These led to the Event. These too are FACTS.
Hypotheses: Just like in high school, these are ‘educated guesses. These are potential causes to the preceding nodes. The initial questioning after the Failure Mode is ‘How Could the preceding node have occurred?’
Verifications: These are the ways in which we proved, with sound evidence, that Hypotheses were True or Not True. Fun fact…hearsay is NOT a valid verification technique.
Physical Root Causes: These are where the physics of failure root out. These are observable, tangible things we can see. These are usually the immediate consequences of decision errors.
Human Root Causes: THIS IS NOT ‘THE WHO DUNNIT’! This is the act of decision-making. These are usually errors of omission and/or commission. We did something we were not supposed to, or we were supposed to do something, and didn’t. The key here is to NOT BLAME and take the opportunity to understand human reasoning.
Latent/Systemic Root Causes: These are the organizational systems, cultural norms and sociotechnical factors that influence and contribute to our decision-making. Unfortunately, our ‘systems’ are far from perfect and are always a works-in-progress. They include, but are not limited to our policies, procedures, training systems, purchasing systems, HR systems, compliance systems and the like.
Contributing Factors: Identify items which did not directly lead to the failure but created vulnerabilities allowing the failure to occur. These are usually conditions that we don’t have control over, but we can often compensate for them (if we are aware of them). For instance, some failures may only occur when its freezing outside. This is a condition that we can’t change, but we can compensate for them in order to the mitigate their potential consequences.
OK, now let’s put these pieces of the puzzle together and show how to reconstruct an undesirable outcome!
In Figure 1, as mentioned earlier, the EVENT is the reason we care enough to commission an RCA. In our example, the event is ‘Unexpected Downtime Due to Pump-235 Failure’. Now the MODES are going to be how we have experienced such failures in the recent past. Most of our CMMS’s can produce these high-level modes. In this case such downtime due to this pump failure has been attributed to failed shafts, bearings and motors. Our data also can tell us the annual cost of each of these modes (hopefully downtime $ + labor $ + materials $). In our case we know that bearing failures on this pump represent the most annual costs. To make a quick business case for your failure, try out this Free Chronic Failure Calculator (CFC).
Figure 1: Event + Modes = Top Box [All must be FACTS]
The Mode level is what we consider our FACT LINE to start. If we start with facts, and provide our hypotheses with sound validations, we will end with facts. Keep in mind we are traveling down the path of the physics of the failure, so we will continually ask the same question, ‘How Could’.
As you use a logic tree to explore the physics of failure, imagine you have the luxury of a video recorder in your head and you are watching the event as its played in reverse. In our case, ‘be the bearing’. Ask yourself, ‘How could I have just failed?’. Move back in short increments of time. It takes some getting used to this type of thinking, but that is the beauty of the logic tree, it guides us without any biases. This tool, when used properly, should be non-personal and non-threatening. We are interested in valid hypotheses, possibilities…that’s it. Then we will use evidence to demonstrate which hypotheses were true and not true. We will only continue drilling down on the ones that are true.
In our case, based on the SME (Subject Matter Experts) on our team, we conclude there are only four (4) ways in which a component can fail: Erosion, Corrosion, Fatigue and Overload. So, we list them as shown in Figure 2.
Figure 2: Hypothesizing and Validation
In our example, we have our on-staff metallurgist visually inspect the failed bearing. They determine with certainty from a visual review, the bearing failed due to Fatigue. No additional exhaustive testing like scanning electron microscopy is needed. This makes the other hypotheses NOT TRUE.
Same questioning, ‘How could we have fatigue of the failed bearing?’. SME’s indicate either from Thermal or Mechanical Fatigue. The metallurgist confirms Mechanical Fatigue.
Our team now asks, ‘How can we have Mechanical Fatigue?’ The prevailing opinion is a sole hypothesis of High Vibration. A review of our PM histories demonstrates this hypothesis to be true.
Questions only beget more questions, as that’s what effective RCA analysts do for a living; they ask the right questions. So, ‘How could we have had High Vibration?’. Our RCA team members collectively come up with: Resonance, Misalignment, Imbalance and Looseness. Evidence pooled together to validate these hypotheses and the team determines that only Misalignment is valid. The journey continues!
Figure 3: Continued Hypothesizing and Root Labeling
How could we have ended up with misalignment? This is where we are now crossing over from the physics of failure, to the human and systems side of failure (or the social sciences). Either someone misaligned the pump from initial installation or repair, OR it was aligned correctly and then became misaligned in operation. Vibration histories demonstrate that since the last installation, this pump has chronically had vibration issues. Notice here that we switched the label on the High Vibration node form a Hypothesis to a Physical Root. This is because this the first visible consequence after the triggering decision.
Notice that after the decision point, everything is triggered on its own, as cause-and-effect linkages go into play. If there are no human interventions to break the error chain, then it will play out and contribute to the undesirable outcome (Event).
We are at a pivotal point in our Logic Tree at this time. Why? Because we have uncovered a decision point. The mechanic in our case chose to align the way they did, on that day. A ‘decision’ point is our queue to identify a human root, and to switch our questioning to ‘Why’ instead ‘How Could’. We are not interested in learning in the infinite reasons the human ‘could have’ made decision, we are interested ‘why’ they did. This is also the point in the logic tree switches from deductive reasoning to inductive.
So let’s drill down further and see if we can figure out what was going on the mechanics mind that day!
Figure 4: Continued Root Labeling
So, in Figure 4, after interviewing our mechanic (using human performance interviewing techniques), we uncover many things that we did not know.
Chances are, all of these ‘Systemic Roots’ have contributed to other failures as well individually or in combinations. This is because most systems are put in place for a multitude of people to use, under a variety of conditions. This particular combination of system flaws, converged on this day, to influence the well-intended mechanic’s decision that day.
Free suggestion, if it was happening to this mechanic, we should check the skill level of others who may be victims of their own systems as well. This is when we determine if the recommendation/correction action is isolated to a single case or more universal.
Lastly, I put a lingering Contributing Factor in our logic tree to make a point. When do we stop digging?
My personal rule-of-thumb to answer this question is, ‘When the solution is obvious’! I tend to not drive down to see who set up the flawed system because I don’t care. It is not value-added because what benefit do I get from finding that out (especially if they are not there anymore). Also, if drilling deeper gets into issues outside our fences, is it of value? In most cases we will not have control of things like changing regulations. However, in many cases if we see an OEM design flaw, we may opt to have our engineering department purse that path with the OEM. But as the analyst in the field, I can hand that part off and move on to my next RCA.
Is everything covered in this blog about ‘RCA’…absolutely not. This is why my title includes the word BASICS. I will post some links at the end of this blog where you can learn more about holistic RCA, but I wanted to get you started with the basics.
Figures 5A, 5B and 5C are presented as simple job aids to help those starting out on the field as RCA analysts, learn the fundamentals of constructing a logic tree. I hope you find this job aid of value.
Remember, ‘We NEVER seem to have the time and budget to do things right, but we ALWAYS seem to have the time and budget to do things again!’. Let’s do RCA right the first time so we don’t have to analyze the same Event again😊!
Figure 5: PROACT Logic Tree Reference Guide (Part I)
Figure 6: PROACT Logic Tree Reference Guide (Part II)
Figure 7: PROACT Logic Tree Reference Guide (Part III)
About the Author
Robert (Bob) J. Latino is CEO of Reliability Center, Inc. a company that helps teams and companies do RCAs with excellence. Bob has been facilitating RCA and FMEA analyses with his clientele around the world for over 35 years and has taught over 10,000 students in the PROACT® methodology.
Bob is co-author of numerous articles and has led seminars and workshops on FMEA, Opportunity Analysis and RCA, as well as co-designer of the award winning PROACT® Investigation Management Software solution. He has authored or co-authored six (6) books related to RCA and Reliability in both manufacturing and in healthcare and is a frequent speaker on the topic at domestic and international trade conferences.
Bob has applied the PROACT® methodology to a diverse set of problems and industries, including a published paper in the field of Counter Terrorism entitled, "The Application of PROACT® RCA to Terrorism/Counter Terrorism Related Events."
Get Bob's Newest Book Here!
Follow Bob on LinkedIn!