DIGGING DEEPER, WITH PRECISION
When a machine breaks down, a significant portion of the maintenance process involves learning why the failure occurred so that it can be avoided in the future. One effective practice in solving failures is root cause failure analysis (RCFA).
According to Mark Latino, president of North American operations for Reliability Center, Inc. (Hopewell, VA), RCFA helps “figure out the underlying cause of a failure, rather than just the symptoms. Then it solves the problem. Once the failure mechanism is identified by the finished analysis, the roots of the problem fall into one of three categories: physical, human, and latent.”
Sounds good, but here’s the catch: RCFA has been a buzzword in maintenance for over ten years. Over that period of time, numerous maintenance professionals have developed a myriad of analytical methods and logic that all proclaim to be the most effective way to eliminate failures. So what is the best method for performing RCFA?
“In a nutshell, the one that solves the problem,” says Latino. “More specifically, focus first on the problems that bring the highest return to the bottom line. Once a problem that needs solving is identified, the method chosen must preserve the failure data.” This means that people must be interviewed, parts must be collected for inspection, testing must be conducted, positions of people and parts during the event in question must be observed, and paper information – maintenance histories, operating procedures, etc. – must be readily available.
Latino explains that a diverse team should be used during the data collection process, hypotheses generation, verification of hypotheses, and testing. The method analyzes all the data and the team generates a “logic tree” that starts with the problem definition, which is called the top box (see Figure 1).
“The top box is comprised of the event that occurred and the facts left behind as a result of the occurrence. Once it is all fact and solid, meaning if the modes are solved the event will no longer occur, all the possibilities of occurrence must be brainstormed with the team and any experts available,” explains Latino. “Hypotheses about the problem are created and the data collected are used to test the hypotheses. As hypotheses are proven or eliminated, the roots that caused the problem will be exposed. At this point it can be determined whether or not human error also contributed to the problem. Then is must be ascertained why the person or people made the decision that led to the problem so they can avoid making the same mistake twice.”
Latino notes that, during this stage of the process, latent roots or system flaws should become apparent, such as accountability/proper supervision was not properly enforced within the company, necessary review systems were not enacted, procedures were not kept up-to-date and checked, etc. “Usually these latent systems are necessary to help guide human behavior and reduce human error,” states Latino.
Finally, the best method is the one that allows the results and recommendations from the analysis to be reported to the proper authorities for final approval and implementation. “True root cause analysis cannot be attained without vigorous verification and a combination of people, parts, position, paper, and paradigm data,” adds Latino.
While all of this may sound like an unnecessarily complicated procedure, Latino emphasizes that, if applied properly, this work will deliver a huge payback. “When root causes are discovered, their correction justified and their elimination finalized, the facilities are raised to unprecedented high reliability that leads to reduced unscheduled downtime and greater productivity,” says Latino. “Unscheduled downtime on all equipment can and should be eliminated. RCFA remains the strongest tool in the reliability toolset for this job. It is dubbed ?the learning tool’ and is the gateway from preventive and predictive technologies to a more precision-based environment.”
Failed part analysis is a critical component of RCFA. To get to the bottom of a failure, one must understand how the physical failure of parts may play a role. This is precisely where data preservation comes into play – the parts tell a story of the forces at play at the time the component failed. “Having this information allows the RCFA facilitator to either prove or disprove hypotheses related to parts, which enable movement to the next level in the logic tree,” explains Latino. “The failed parts can be examined by either internal or external experts that expose overpowering of material, fatigue, misalignment, erosion, corrosion, and a host of other problems.”
Oftentimes, a component fails due to simple human error, such as installing the wrong part or installing a part backwards. The best way to prevent components from failing is to understand how and why they have failed in the past, which is the job of RCFA. “Once the latent level is reached, future failures are prevented because system-wide problems are identified. These can range anywhere from inaccurate procedures to bad habits and cutting corners,” remarks Latino. “The best way to prevent failures is to ensure that all involved have the proper job training, the correct tools, enough time and the correct procedures to do the job, as well as a review and verification process for these factors.”
Procedure effectiveness analysis plays an important role in RCFA, because procedural steps must be correct at all times and written in such a way that a new employee can understand them and a long-time veteran can be comfortable with them. “In many cases, for those workers who are set in their ways a checklist may be the best way to ensure all procedures are properly followed to reduce failures,” states Latino. “Procedure analysis should be a normal business practice of review and verification at some interval – typically every two to three years – just like preventive maintenance. The RCFA process will identify procedures that failed to work when actually used prior to an undesirable event.”
Most importantly, the procedures involved in performing RCFA are being simplified by technology. “RCFA software speeds up the process tremendously and helps identify and calculate the value of losses that occur from failures,” claims Latino. “This has given companies the ability to trend root causes over numerous completed analyses in order to detect patterns.” Another advancement in RCFA is failure event templates that allow the RCFA facilitator to easily research similar failure events and view the logic tree for these events. These help the facilitator take into account other possibilities that the team may not have discovered, speeds up the analysis and increases accuracy.
BEYOND RCFA? ANOTHER VIEW
RCFA is, without question, an effective failure-prevention approach. But is there more than one way to skin this cat?
An interview with C. Robert Nelms, founder and president of Failsafe Network, Inc. (Montebello, VA) explores failure prevention from a completely different perspective:
F&M: What role does RCFA play in maintenance? What is the best method available for performing RCFA with its associated costs?
Bob Nelms: I do not use the phrase “RCFA” and I have learned to avoid the use of root cause because it has become a nebulous, diluted, even meaningless phrase that adds confusion rather than clarity. At one time, I used the acronym “RCA” (deliberately omitting the term failure because it is normally associated with and limited to equipment issues). In fact, I no longer use “RCA” anymore, either.
Instead, I have learned to use “LCA” (Latent Cause Analysis) as a mindset than can be applied to anything that goes wrong, whether at work or home, and ends with two questions: “What is it about the way we are?” and “What is it about the way I am?” that contributed to this event?
As for the best method and the associated costs – I’m afraid I’ll be giving you a biased answer because I have developed a unique and effective organizational approach to learning from things that go wrong. My process involves the whole organization.
F&M: Let’s exchange “RCFA” for “LCA.” Does LCA usually produce the desired result? How effective is it in preventing unscheduled downtime, and other equipment maintenance-related problems?
Bob Nelms: Good question! LCA defines Physical Causes, Human Causes, and Latent Causes. Physical Causes are easily “fixed.” Human Causes are never “fixed.” Instead, Latent Causes are addressed. But Latent Causes answer the question, “What is it about the way I am that contributed to this incident, and what am I going to do about it?”
The effectiveness of the LCA method is somewhat, but not totally, dependent upon the people involved. Although LCA can certainly be performed on the causes of “unscheduled downtime, etc.,” the effectiveness of the exercise is somewhat limited to those people involved. They must desire to “change the way they see things” in order to change whatever needs to be changed.
F&M: What role does failed part analysis play in LCA and, in broader terms, lean maintenance? When it comes to equipment problems, how often is a failed part the culprit? How can one try to prevent parts from failing?
Bob Nelms: I’m sorry, but my experience with lean maintenance has not been good. I don’t like the word lean and what it implies. I have certainly been educated on what it is and how it works, and I suppose lean has its place. But LCA goes far, far beyond lean, into a deep and philosophical journey that prompts people to start wondering “Why do things go wrong?” In this journey, everything begins to be questioned – including things like “lean.”
As for failed part analysis, it always plays an important and vital role in LCA. Even more, all of the evidence plays a vital role. People know things, the physical manifestation tells us things, and our history tells us things. Failed parts, however, are never the culprit. Parts fail for reasons – and those reasons can always be traced to people. Ultimately, people cause everything that goes wrong.
In order to prevent parts from failing, we must know what people do that causes them to fail, then address those human aspects of the failure, such as “What is it about the way I am that contributed to this incident?” and “What am I going to do about it?”
F&M: What role does procedural effectiveness analysis play in LCA? How does one identify which procedures should undergo analysis?
Bob Nelms: When something goes wrong, we know it has gone wrong because it manifested itself in some way. The way something manifests itself is what we call evidence. An intense scrutiny of evidence always points to the problem areas of a business – whether it’s a procedure, policy, checklist, or more likely a “way of thinking.” We have a motto in our work: “Let the evidence guide you!!”
F&M: What are the most recent trends/advances in these fields?
Bob Nelms: Most RCA efforts have avoided the human being. That trend is changing – and changing fast. Human beings cause problems . . . not machines, nor even systems. The most progressive of RCA efforts focus relentlessly on helping human beings see themselves as part of their problems in a constructive and caring manner.
Kyndall Brown is the assistant editor of Fabricating & Metalworking magazine.
Reliability Center Incorporated, 501 Westover Avenue, Hopewell, VA 23860, 804-458-0645 Fax: 804-452-2119, www.Reliability.com.
Failsafe Network, Inc., PO Box 119, Montebello, VA 24464, 540-377-2010, Fax: 540-377-2009, www.failsafe-network.com.








