Does procedural and/or regulatory compliance with RCA guidelines ensure Operational Reliability? Does it ensure improved Safety? Operational Reliability involves the aggregation of Equipment, Process and Human Reliability methods and techniques.
What is the difference between troubleshooting, problem solving and ‘RCA’? Are the outcomes different when we use The 5-Whys, The Fishbone or a Logic Tree/Causal Factor Type Tree?
Can deficiencies in our approach to RCA increase the risk of excessive downtime? These questions will be discussed in depth and contrasted using a common example to determine if we are applying a form of Root Cause Analysis or Shallow Cause Analysis.
“Cause and effect, means and ends, seed and fruit, cannot be severed; for the effect already blooms in the cause, the end preexists in the means, the fruit in the seed." - Ralph Waldo Emerson
We will start this discussion with a quote from an article in Quality Digest:
“Is the healthcare industry in denial when it comes to practicing Six Sigma? The answer, unfortunately, is yes. Although the industry is slowly adopting the methodology, the majority of these initiatives aren’t designed to improve the quality of the medical treatment offered to patients. Instead, most health organizations focus on improving care from the administrative side. As a result, patients aren’t getting the quality improvements to which they are entitled. The real issues facing healthcare are ignored due to medical practitioners who are afraid to admit that the lack of quality care is a result of their own errors and inefficiencies.”
While this article focused on the specific application of Six Sigma in healthcare, it may as well have been written about Root Cause Analysis (RCA) in industry as well. The driving force behind statements like the above is that regulatory compliance is being confused with Operational Reliability. We are being led to believe that if our RCA efforts are compliant, then the operation is more reliable (and thus safe). In the quote above we can be compliant yet not affect the Reliability of our operations. That should defeat the purpose of the intent of the applicable regulations. If it does not, then the regulation itself has loopholes. The question boils down to, if we pass a regulatory audit of our investigative practices, does that ensure the operation is any more reliable or safe? NO.
Let’s take the concept of ISO-9000 compliance. The usual mantra is “write what you do, do what you write”. This does not mean that what you wrote was correct. However, if you follow an incorrect procedure, you are compliant. Is your operation any more reliable or safe, as a result?
For the many that will read this paper they will be able to reflect on their own experiences under such conditions. They will read, think back and realize that success was tied to passing the audit as opposed to linking their
‘RCA’ effort to how the operation was made more reliable and safer. The concept of true Root Cause Analysis has been replaced with the concept of Shallow Cause Analysis as a result.
Shallow Cause Analysis (SCA) represents a less disciplined approach to Operational Reliability than holistic Root Cause Analysis (RCA). Many of the tools on the market today that are being referred to as Root Cause Analysis, fall short of the essential elements of an RCA. Typical tools in this category are the 5-Why’s, the fishbone diagram and many form-based RCA checklists. Many of these tools came from the Quality initiatives, which flourished in the 70’s and 80’s and remain ingrained in American corporations today.
We refer to these as tools and just like tools in a toolbox; we must use the right tool for the right project. Therefore, we must have a clear understanding of the scope of the project before deciding which tool is most appropriate.
When determining the breadth and depth of analysis required, we must explore the magnitude and severity of the undesirable event at hand. Typically, we would not conduct formal RCA on events, but rather their consequences. If we have an event occur, an undesirable outcome of some sort, then its priority is usually proportional to the severity of its consequences.
When is it appropriate to use brainstorming versus troubleshooting versus problem solving versus RCA? While a hundred definitions likely exist for each of these terms, we choose to use the following ones:
Brainstorming: A technique teams use to generate ideas on a particular subject. Each person in the team is asked to think creatively and write down as many ideas as possible. The ideas are not discussed or reviewed until after the brainstorming session.
Troubleshooting: To identify the source of a problem and apply a solution to "fix it
Problem Solving: The act of defining a problem; determining the cause of the problem; identifying, prioritizing and selecting alternatives for a solution; and implementing a solution.
Root Cause Analysis: The establishing of logically complete, evidence based, tightly-coupled chains of factors from the least acceptable consequences to the deepest significant underlying causes.
In order to recognize what is Root Cause Analysis and what is NOT Root Cause Analysis (Shallow Cause Analysis), we would have to define what criteria must be met in order for a process and its tools to qualify as Root Cause Analysis. The following are what we consider the essential elements of a true Root Cause Analysis process:
This is traditionally where a collection of experts throws out ideas as to the potential causes of a particular event. Usually, such sessions are not structured in a manner that explores cause-and-effect relationships. Rather people just express their opinions and come to a consensus on solutions. When comparing this approach to the essential elements listed above, brainstorming falls short of the criteria to be called RCA and therefore falls into the Shallow Cause Analysis category.
This is usually a “band-aid” type of approach to fixing a situation quickly and restoring the status quo. Typically troubleshooting is done by individuals as opposed to teams and requires little to no proof or evidence to back up assumptions. This off-the-cuff process is often referred to as RCA, but clearly falls short of the criteria to qualify as RCA.
This comes the closest to meeting the RCA criteria. Problem Solving usually is team-based and uses structured tools. Some of these tools may be cause-and-effect based some may not be. Problem solving oftentimes falls short of the RCA criteria because it does not require evidence to back up what the team members hypothesize.
‘When assumption is permitted to fly as fact in a process, it is not RCA.’
Figure 1: Comparison of Analytical Processes to RCA Essential Elements
The goal of this description is not to teach how to use these tools properly, but to demonstrate how they can lack breadth and depth of approach.
Analytical tools are only as good as their users or put another way, ‘an analysis can only be as good as the analyst’.
Used properly, any of these tools can be used comprehensively to produce desired results. However, experience shows the attractiveness of these tools is actually their drawback as well. These tools are typically attractive because they are quick to produce a result, require few resources and are inexpensive. These are the very same reasons they often lack breadth and depth.
Let’s start here. While there are varying forms of this simplistic approach, the most common understanding is the analyst is to ask the question “WHY?” five times and they will uncover the root cause.
The form this approach may look like is as follows:
Figure 2: The 5-Why’s Analytical Tool
There is a reason we do not see NTSB investigator’s showing the 5-Why approach at a press conference after an accident. The main flaws with this concept are that failure does not always occur in a linear pattern (rarely if ever based on my 35+ years in the business). Multiple factors combine in parallel, then they converge at some point in time to allow the undesirable outcomes to occur.
‘Also, there is almost never a single root cause and this is a misleading aspect of this approach’.
People tend to use this tool by themselves and not in a team, and rarely back up their assertions with sound evidence.
The fishbone diagram is also one of the most popular analytical Quality tools on the market. This approach gets its name from its form, which is the shape of a fish. The spine of the fish typically represents the sequence of events leading to the undesirable outcome. The fish bones themselves represent cause categories that should be evaluated as to having been a potential contributor to the sequence of events. These categories change from user to user. The most popular categories tend to be:
Figure 3: The Fishbone Diagram Sample
The fishbone is often a tool used for brainstorming. Team members decide on the categories and continue to ask what factors within the category caused the event to occur. Once these factors are identified then they ask why the factors occurred and so on.
As a brainstorming technique this tool is less likely to depend on evidence to support hypotheses and more likely to let hearsay fly as fact. This process is typically also not cause-and-effect based, but cause-category based. The users must pick the category set they wish to use and suggest ideas within that category. If the correct categories for the event at hand were not selected, key root causes could be missed.
The PROACT® Logic Tree is representative of a tool specifically designed for use within RCA. The logic tree is an expression of cause-and-effect relationships that queued up in a particular sequence, at a particular time, to cause an undesirable outcome to occur. These cause-and-effect relationships are validated with hard evidence as opposed to hearsay. The evidence leads the analysis, not the loudest expert in the room.
A logic tree starts off with a description of the facts associated with an event. These facts will comprise what is called the Top Box (the Event and the Modes). Modes are the manifestations of the failure and the Event is “the least acceptable consequences” that triggered the need for an RCA. While we may know what the Modes are, we do not know how they were permitted to occur. So, we proceed with the questioning of ‘How Could’ the Mode have occurred?
How Could vs Why. Many have been conditioned to ask the question ‘Why’ during such analyses. However, using this methodology the initial question used is ‘How could’ when exploring the physical aspects of the failure. When looking at the differences between these two questions we find that when simply asking ‘Why’ we are connoting a singular answer and to a point, an opinion. When asking ‘How Could’ we are seeking all the possibilities (not only the most likely) and evidence to back up what did and did not occur.
This questioning process is reiterative as we follow the cause-and-effect chain backwards. Simply ask the questions, answer them with hypotheses and use evidence to back it up.
Human Roots. This holds true until we uncover the Human Roots or the points in which a human made a decision error. Human Roots represent errors of omission or commission by the human being. Either we did something we should not have, or we did not do something we should have done. At this point we are exploring the reasoning of ‘Why’ someone made the decision they did.
This is an important point in the analysis because we are seeking to understand why someone thought the decision they made, was the correct one at the time. At this point in the analysis, we do switch the questioning to ‘Why’ because we are exploring a set of answers particular to an individual or group. We are seeking their reasoning.
Latent Roots. Our answers are what we call Latent Root Causes or the organizational systems in place to help us make better decisions. The Latent Roots represent the rationale for the decision at the time that triggered the consequences to occur. These are called latent because they are always there lying dormant. They require a human action to be triggered and when triggered, they start a sequence of Physical Root Causes to occur. This error-chain continues, if unbroken, to the point that it results in an adverse outcome that requires an immediate response.
As can be told from this description, the logic tree approach is certainly cause-and-effect related, requires evidence to back up what people say and requires depth, the understanding of the flaws in the systems that contributed to poor decisions.
The failure of a process to achieve its designed objective has to do with the design of the linkages between steps in the process: how the steps relate to one another – the hand-offs. It is the interrelationships that are themselves prone to failure and that propagate the effects of a failure to other parts of the process, often in ways that are unexpected (side effects) or not immediately evident (long-term effects). The logic tree’s strict adherence to graphically representing these tightly coupled relationships make it more accurate than the other tools described for that reason.
Figure 4: The PROACT® Logic Tree
In addition to these most commonly used approaches described above, many simply use form-based Root Cause Analysis. This is basically a one size fits all mentality. It is root cause ‘by-the-numbers’ similar to painting-by-the-numbers. The same questions are asked no matter the incident and opinions are often input as acceptable evidence. Checklists are often provided which give people the false sense that the correct answer must be within the listed items.
No “pick-list” RCA process can ever be comprehensive enough to consider all the possibilities that could always exist in each working environment.
However, the innate human tendency to follow the path of least resistance makes using picklists very attractive.
As noted, author Eli Goldratt says:
“An expert is not someone that gives you the answer, it is someone that asks you the right question”.
That is exactly what RCA is all about.
Many people choose to use form-based RCA systems because the regulatory authority seeking compliance, provides them free of charge and suggests they be used. The paradigm is that “we are using their forms so we will have a better chance of complying if we use them”. This may indeed be true but does not mean the analysis was comprehensive enough to ensure the undesirable outcome will not recur. Hence, once again, compliance does not necessarily ensure operational Reliability or Safety!
All the aforementioned tools can either be applied manually using a paper-based system, or automated using a form or fashion of software. One point we need to make clear is that software IS NOT a panacea for any analysis. We liken this to Microsoft Word®, if you do not know English, it is of little value. The same holds true for RCA software, if the analyst does not understand proper investigative methodology and technique, software will be of little value.
Experience shows most of the time such analyses are conducted using paper-based approaches (easel pad and sticky notes). This leads to a double handling of data and a time lag. After the team meeting, some poor sole must then re-input the data from the easel pads and post-its into an appropriate program (i.e. – word processor, graphics program or spreadsheet program). Then usually about a week later the information is disseminated to the team members for them to review and conduct their assigned tasks.
Once paper-based analyses were completed, they were then presented, distributed, and put into a flat file somewhere. One of the greatest advantages any organization can get from RCA is to raise the knowledge and skills of their workforce regarding how failures have occurred in the past. This is often referred to as lessons learned in the nuclear industry.
The primary value of software is to efficiently document and disseminate information. Technology is more effective than humans in enhancing process consistency and in receiving, storing, and processing information. Technology does not take shortcuts. It is not influenced by emotion. And it has the advantage of being a long-term improvement in contrast to risk-reduction strategies that, say, focus on staff retraining.
Reduction of Re-Work. Software can eliminate the double handling of data related to any analysis. Experience shows that this cuts the analysis time in half (on average), simply due to conducting the analysis if a more efficient manner and, getting people information quicker and reducing the amount of team member time required per analysis.
Institutionalizing Knowledge. Software also provides great flexibility in storage of analyses. All analyses can be stored in a single database that can be mined for lessons learned. For instance, if we would like to search the data base (often called data mining) for all analyses conducted on motor failures on the digester in the wood yard, we can easily do so to see how others have approached a similar problem we may be experiencing. Effective use of this sharing is often referred to as knowledge management or corporate memory.
Potential Technology Disadvantages. However, as with all advantages there come some disadvantages. Technology itself can intimidate people and create a resistance to their using it. We tend to trust humans as opposed to machines. For instance, “pilots tend to listen to the air traffic controller (as opposed to messages they receive from a machine) because they trust a human being and know that a person wants to keep them safe.”
The Tools is Only as Good as the Craftsman. No matter the analytical process used, the tools employed in the execution or technology used; if the craftsman [analyst] using the tool is not educated properly the tool will not function to its fullest capability. Analysts must have a complete understanding as to the difference between a shallow cause analysis and a Root Cause Analysis. Without knowing the differences, how can they be sure they can be credible and thorough? If they are not sure they have captured all of the contributing causes they cannot ensure the undesirable will not happen again. Analysts must also have the desire and the will to find the whole truth and settle for nothing less. The problem with this purist approach is that many in the organization do not want to know the truth – that is another paper!
Case Study Background: XYZ Company was receiving numerous complaints from a particular customer about contamination of their delivered product (solvent), which had visible black ‘specks’. This was unacceptable and the delivery was refused and returned by the client.
Let’s review this case and apply the 5-Whys, Fishbone and Logic Tree Approaches. This was actually done as a test with this particular client using 3 different teams. These are the results.
In this case ‘Why’ is asked 5 times after the Event, and in this case, concludes with a single cause of ‘Perceived as Not Required
Figure 5: 5-Whys of Customer Complaint Case
In this case the team applied the 6-M version of the Fishbone Diagram. Under the categories chosen, the following findings were concluded about the case…even though the team did not ask for evidence to support these conclusions. Also, it should be noted, that all of the teams were afforded the opportunity to ask for more evidence if they felt it would help their analysis to be more comprehensive.
Figure 6: Fishbone Diagram of Customer Complaint Case
The PROACT® Logic Tree (Using EasyRCA Software Solution)
In this case, the team applied the PROACT® RCA Methodology to the same case.
I will describe their thought process simply using the consistent questioning process of this approach.
EVENT: Repeated Customer Complaints (This triggered the need for the RCA)
MODE: Black Specs in the Solvent Shipment (Validated fact by the customer)
1st Level of Hypothesis Questioning: How could black specs have gotten into the solvent shipment?
The Identified Potential Hypotheses:
These are the only four (4) steps of the operation where the product could have been contaminated.
Figure 7: Logic Tree of Customer Complaint Case (Part 1)
Evidence requested by the team of these hypotheses reveal that contamination was actually occurring at several steps of the process flow. However, there was no evidence of contamination entering the product from the tank truck loading operations, so that possibility is found to be NOT TRUE.
Now lets take each of the hypotheses found to be TRUE (using sound evidence) and continue drilling down with the ‘How Could’ questioning.
Figure 8: Logic Tree of Customer Complaint Case (Part 2)
Now lets take each of the hypotheses found to be TRUE (using sound evidence) and continue drilling down with the ‘How Could’ questioning.
Figure 9: Logic Tree of Customer Complaint Case (Part 3)
Now lets move on and explore how we could be getting specks in the solvent via the tank trucks that transport the product to the customer. Here we will continue drilling down with the ‘How Could’ questioning.
Figure 10: Logic Tree of Customer Complaint Case (Part 4)
Based on the above examples of the various tools applied to the same situation, we could construct a “filter” of what tools identified which ‘root causes’ by the respective teams. This demonstrates root causes and contributing factors that could be missed by not using the most appropriate tool for the magnitude of the event being analyzed.
Figure 11: Analysis Approach Comparison of Results
The use of the 5-Whys leads users to believe only one (1) root cause exists. Since evidence is not normally required to validate this string of logic, that one (1) cause could likely be correct, but not the only root cause involved.
The fishbone, while more exploratory than the 5-Whys, is a brainstorming technique that relies solely on the input of the team to serve as fact. Because it is not strictly cause-and-effect based, but category based, a path to failure is murky at best. Because hearsay is the primary source of evidence, the limited causes identified could also very well be wrong (similar to trial-and-error) and/or not comprehensive enough.
The PROACT® Logic Tree is more comprehensive because it attempts to “rewind the video” of the event happening. It is starting with facts and reeling backwards from that point. Evidence collected will determine what did and did not occur, not hearsay. The logic tree will drill past the physical and human levels to uncover the systems issues or the latent root causes that influenced decision-making. Without correcting the systemic issues, we will likely run the risk of recurrence of the event somewhere, sometime. By correcting the systems issues, we will correct the undesirable behaviors (decision-making processes) that triggered the physical consequences to occur and eventually harm the patient.
When evaluating which RCA processes are best for your organization be sure not to let factors such as cost, minimum compliance, time and ease of analysis, trump the important characteristics of value, comprehensiveness, operational Reliability, safety and efficiency. Otherwise as the old adage goes, we will face the “pay me now or pay me later” scenario and this is dangerous when lives are at stake.
About the Author
Robert (Bob) J. Latino is CEO of Reliability Center, Inc. a company that helps teams and companies do RCAs with excellence. Bob has been facilitating RCA and FMEA analyses with his clientele around the world for over 35 years and has taught over 10,000 students in the PROACT® methodology.
Bob is co-author of numerous articles and has led seminars and workshops on FMEA, Opportunity Analysis and RCA, as well as co-designer of the award winning PROACT® Investigation Management Software solution. He has authored or co-authored six (6) books related to RCA and Reliability in both manufacturing and in healthcare and is a frequent speaker on the topic at domestic and international trade conferences.
Bob has applied the PROACT® methodology to a diverse set of problems and industries, including a published paper in the field of Counter Terrorism entitled, "The Application of PROACT® RCA to Terrorism/Counter Terrorism Related Events."