In this original post, A Mechanic's Story: Basic Component Fatigue, we took a detailed journey through the physical side of a shaft failure RCA. We stopped at the physical side of that failure, parallel misalignment. However, stopping at the component level of failure does not constitute a credible and thorough RCA. Actually stopping at this level is more along the lines of a Shallow Cause Analysis (SCA). So let's explore what makes the difference between a Shallow Cause Analysis and a Root Cause Analysis (RCA).
In the previous post we stopped at parallel misalignment. We will continue drilling from that point down. We ask 'How could we have had parallel misalignment?' Our team of subject matter experts (SME) hypothesizes 1) it was either misalignment at installation or 2) it became misaligned during operations.
As with all the other hypotheses in the logic tree, these have to be validated. Vibration histories show a trend of vibration related issues on this pump, since the point of its inception. Further investigation finds that the technician who did the alignment, was not necessarily qualified to perform such a task. So in this case evidence points to the fact that we have an issue with how we are aligning. So we cross off the hypotheses that are not true, and follow the evidence to the ones that are true.
At this point we switch from dealing with 'hard' issues (component related) and now we are dealing with 'soft issue' (human related). Now there is a shift in the logic tree thinking, from deductive logic on the component or physical side (general to specific) to inductive logic on the human and systems side (specific to general). The key point for an analyst to understand here is that the questioning shifts from "how could?" to "why?" at this point. THIS IS A CRITICAL POINT IN THE CONSTRUCTION OF A PROPER LOGIC TREE AND AN EFFECTIVE RCA.
So we continue on with 'Why would the technician have aligned the pump in the manner he did (improperly)?' Now we are in the mind of the decision-maker and exploring their rationale for why they made the decision they did, at the time they did. We have to take into considerations all the factors affecting their decision (i.e. - environment, time pressures, training, communications, systems, etc.).
Our SME team groups the possibilities into two main hypotheses, 1) they didn't have the proper tools to do the job correctly and/or 2) they didn't know how to align properly in the first place. So know we have to explore the evidence to support or refute these hypotheses.
An inspection of the alignment tools reveals they were less than adequate (LTA) and not kept up to OEM and company standards. So even if the technician knew how to align properly, they could not do so with the equipment they were provided.
On the knowledge side, how would we be able to see if the technician was qualified to align or not? In my experience, if someone knows you are watching their work, that is the time they will surely do their best! What can we conclude if they are not aligning properly, when someone is watching? Chances are, they do not know how to align properly.
So in our case, both of these hypotheses are true. Therefore we continue to drill down past the decision-maker, and into their reasoning for why they made the decisions they did, at the time they did.
Let's start with the tools/equipment side. Why would technicians not be provided adequate tools to do their jobs? A review of related systems reveals that there were no annual reviews and inspections of such tools/equipment. So no systems existed to ensure annually that the front lines had the proper tools to do their jobs the best they could. An important point here as we jump from a decision to the systems that influenced the decision, is that we may have uncovered a system deficiency that affects more than the failure we are looking at.
Think about it. If the systems are non-existent for proper review and inspection of alignment tools/equipment, chances are we better check our other tools/equipment. A good analyst will always seek to see if the system deficiency is isolated to the case they are analyzing, or if the issue is more universal. In this case, when we get to recommendations, they will likely be applied universally.
Let's look at the knowledge and skill side of this human element. Why would the technician align improperly? Remember, we cannot let hearsay and emotion drive an analysis. We have to let sound evidence and intellect drive the analysis. So how can we prove why someone is not aligning properly?
In this case, we find the technician who normally did the alignments (and was trained and qualified to do so) had retired. A void was left behind for this skill, and the newest hire was then given the task to conduct these alignments. No formal training was provided. This new technician had assisted the retiree in the past, so it was felt he had enough OJT (on-the-job training) to do the job correctly. Unfortunately that was not the case.
Also, from a human standpoint, typically a young new technician is unlikely to tell their bosses they don't know how to do something. They typically feel lucky to have a good paying job and don't want to jeopardize their good fortune. Plus those who practice 'Just Culture' are few and far between these days. Most organizations do not have such an open environment where technicians feel safe to tell their bosses they don't know how to do something.
In order to not align properly, there has to be someone misaligning as well as someone allowing the poor work to continue. Where was the management oversight? Shouldn't someone have recognized they had a person in a position who was not qualified? It is rare that RCA's include this introspection because it is hard for leadership to look in the mirror and acknowledge they are part of the problem! As mentioned before, in this case, is this system flaw only related to this failure or likely to be happening in other areas of the facility/organization? I think you know the answer:-)
Lastly, interviews with the technician reveal he did learn a lot from his predecessor, but the reality was that production time-pressures forced him to skip steps in order to expedite the alignment process. Rather than be seen as the one holding up production, he took short-cuts. Again, this scenario is not very uncommon as you well know..
So now we are at the age old question of "Where do we stop?". If one wanted, they could take such an analysis back to the beginning of time. That would not be a good use of our time. So how deep do we go? There is not a canned answer here as each facility will be different in their practices and culture. My rule of thumb is to drill down as far as I can, and identify where I have control and/or influences to do something about it (in terms of appropriate counter-measures).
I may be a mechanic who only has influence and control over the physics of the failure. At that point I may turn over the analysis to people above my pay grade who can properly deal with the human and systems issues. How deep we end up going will be the difference between a shallow cause analysis and a root cause analysis.
In this graphic the blue areas represent the upper level of the logic tree or the physics of the failure/component level (Failure Cause Analysis). Then there is a decision involved (Decision Analysis). Dr. Dekker refers to this point as 'Sensemaking'. Beneath that decision are the reasons for the decision. So how deep do we want to go? At what point is it not value-added anymore to drill down at the plant?
Typically we can have most influence and control at the latent root/system flaws and cultural norm levels. Sociotechnical factors are normally outside the control of the facility and far removed from the front lines. These are issues like where our regulations come from, how we are insured, how our legal policies affect behavior on the floor, etc. These are typically external influences on the organization.
Such factors are value-added, but only to someone who has the clout to make changes regarding these broad based factors.
Sorry for the length of this one, but I wanted to carry you through the complete thought process of an RCA, using the approach we have been using for decades, PROACT. I understand that not all 'RCA' is the same and that the term itself is misleading.
Thanks for your time, patience and interest!
For additional resources related to this topic (not salesy stuff, but free useful stuff):
PROACTOnDemand Demo Accounts (used to manage investigations)
About the Author
Robert (Bob) J. Latino is CEO of Reliability Center, Inc. a company that helps teams and companies do RCAs with excellence. Bob has been facilitating RCA and FMEA analyses with his clientele around the world for over 35 years and has taught over 10,000 students in the PROACT® methodology.
Bob is co-author of numerous articles and has led seminars and workshops on FMEA, Opportunity Analysis and RCA, as well as co-designer of the award winning PROACT® Investigation Management Software solution. He has authored or co-authored six (6) books related to RCA and Reliability in both manufacturing and in healthcare and is a frequent speaker on the topic at domestic and international trade conferences.
Bob has applied the PROACT® methodology to a diverse set of problems and industries, including a published paper in the field of Counter Terrorism entitled, "The Application of PROACT® RCA to Terrorism/Counter Terrorism Related Events."
Interested in building a Reliability culture?
We've trained tens of thousands of people across hundreds of companies all around the world, and we'd love to help your organization wherever it might be on the journey.