Today, a world-class manufacturing organization that doesn’t have a good handle on its problems is found. Reliable data in CMMS and distributed control systems provide evidence to support problem analyses to identify and eliminate problems. Most have a root cause analysis program in place and apply structured methods to events exceeding their threshold criteria.
An opportunity exists to step away from the smaller, day-to-day incidents that normally garner the majority of the organization’s attention. To elevate the scope and focus of the problem-solving efforts onto a higher value plateau. They should identify and eliminate systemic causes.
Once we begin to expose systemic causes, their elimination will:
· Prevent recurrence of past problems
· Prevent future problems (Focus on and eliminating systemic causes to be proactive with root cause analysis efforts.)
· Yield massive savings and improvements (in many cases across multiple functions of the organization)
What Are Systemic Causes?
In the realm of problem-solving, a systemic cause is a deeper-seeded cause — one less obvious and normally not identified in the typical root cause analysis. It is as its name implies — rooted in the systems of the organization. Because systemic causes originate in an organization’s core infrastructure (physical, procedural and cultural), systemic causes are the less visible causes responsible for creating many other causes which normally are more visible to an organization. Left unchecked, systemic causes will always generate new problems, thus facilitating a higher level of ongoing, systemic risk within the organization.
Systemic causes drive people to make decisions, both successful and flawed. When people make mistakes, frequently a systemic cause is the reason for their apparently flawed decision. For example, if an organization has a “hero cookie” culture where the perceived reward for success (by taking the shortcut or whatever means necessary) is greater than the negative reward for failure, employees will take shortcuts routinely. Some will be successful, but many will not. Some will be catastrophic.
For rigorous root cause analysis, if you want to understand what needs to be done to change behaviors, you need to understand the systemic causes. Familiar examples include the Space Shuttle Columbia accident investigation or the BP Texas City refinery explosion. In both investigations, the RCA teams explored the areas where systemic causes were at play, including the organizational cultures during the times leading up to the incidents.
Where Are Systemic Causes Found?
A systemic cause is normally prevalent across the entire organization and will show up far to the right on the cause and effect chart (in charts that move back in time to the right). Unfortunately, very few analyses ever drill deep enough to include systemic causes, yet they are contributors to all problems.
Systemic causes can be found in:
· HR systems
· Organizational culture
· Accepted practices
· Reward systems
· Work processes
· Management systems
For example, let’s say we examine a root cause analysis on a bearing failure caused by contaminated oil. The contaminated oil is caused by water in the oil, and the water escapes detection because there was no sampling. There was no regularly scheduled replacement of the oil.
Once we start digging for causes, we’ll quickly find the causes for how the water entered the oil reservoir. This is normally the sole focus. But, in a systemic cause analysis, these will be intermediate causes. We need to drive beyond these causes to find out more about the entire oil monitoring program, among other things. As we dig deeper, the causes will now start to become more systemic. Frequently, in cases where we find “deficient oil monitoring programs,” if we continue to seek causes, we often find an “insufficient PM/PdM program” caused by “insufficient budgeting” and “low awareness of reliability program value.”
Normally in the root cause analysis, people stop looking for causes once they get to “contaminated oil.” The systemic cause of “low awareness of reliability program value” not only creates the problem of no oil sampling, but it is going to cause many other problems, such as “poor planning/scheduling,” “excessive stores inventory,” and/or “high reactive workloads,” etc.
Let’s look at another common bearing failure scenario — bearings that routinely fail on a highly loaded process fan. Normally, we see organizations explore only the first- or second-level causes for the bearing failure.
The bearing manufacturer’s analysis determines there was an extremely high radial load on the failed bearing given the speed of the fan. The high radial load is caused by extremely high belt tensions on a four-belt sheave. The four-belt sheave is required to carry the process load, and the belts are maintained at a very high tension to prevent slippage. All components are at the limit of their capacity.
Normally at this point, the root cause analysis would stop and solutions would focus on beefing up the bearings, increasing the precision on the belt tensioning or installing alternative drive configurations. While all these solutions are good ideas and may solve the problem in the short term, a greater opportunity is being missed.
If we keep digging deeper for causes, we will find the causes for the high fan load, the seemingly undersized bearings, and so on. Asking “why” further could point out, for example, that no engineering review was done when the fan was installed. Perhaps it was pulled out of reclamation and installed as part of an ad-hoc, rush project where conventional reviews and protocol were skipped? Or maybe there was a significant change in the process operating conditions that greatly increased the fan loading, but no management of change (MOC) review was performed?
While changes in operational conditions are common, the fact that the “protective systems” like MOC and engineering reviews are cast aside when seemingly higher priority issues come up is a systemic problem (cause). When you uncover causes like “no engineering reviews,” or “no MOC,” it is common that the management systems to support these activities are lacking or inconsistently applied.
While there could be any number of potential systemic causes in the above example, the fact remains that there are more deeply rooted causes originating within the organization’s infrastructure that, left unchecked, will reproduce more problems of a similar nature in the future. (Incidentally, investigations of catastrophic events commonly uncover MOC reviews or management system checks that were ignored, not performed or non-existent.)
By uncovering deeper-seeded systemic causes, you begin to see what is causing the common causes like “overloaded bearings” or “belts too tight.” In the previous scenario, implementation of systemic solutions such as deploying an MOC process or requiring projects to “undergo an engineering review” when operating conditions change will be far more effective (and proactive) at eliminating these same types of failures than just fixing the individual problem at hand.
Why Do Systemic Causes Get Overlooked?
The three most common reasons why systemic causes never emerge from an root cause analysis are:
1. They remain unrecognized
2. The root cause analysis stops too soon — other, smaller causes are identified for solutions
3. People are afraid of including them because of political reasons
Why would someone be afraid to include a cause in root cause analysis because of political reasons? As you spend more time working with systemic causes, you will begin to see a pattern. Systemic causes are often created by or enabled to exist because of:
· Internal politics
· Leadership strategies
As you can imagine, broaching the above issues could draw fire from your leadership if not handled delicately. They may view your analysis as “sticking your nose in their business” or an attack on their performance. Accordingly, you need to proceed cautiously.
Be prepared to back up your claim of a systemic cause with substantial data and evidence before you attempt to sway any leader’s opinion. As we will see later, this can be done more easily than you may expect by dusting off and pulling out a root cause analysis done in the past.
Ask leaders to serve on a secondary root cause analysis team, one that reviews the previously completed RCA and augments it, for the purpose of understanding if there are any systemic causes involved. If the leaders are part of the problem-solving team, they will be more open to the results that uncover the systemic causes.
Which Systemic Causes Do You Tackle and How Do You Secure Your Leaders’ Support?
Once you decide to pursue a systemic cause analysis, where do you start and what is the easiest way to determine which systemic causes to target? The answer may be easier than you think. The following process will not only help you identify high-value systemic causes to tackle, but will also help generate the leadership support needed for approval of resources necessary to solve the problem.
Steps to Identify Systemic Causes Using Root Cause Analysis
1. Select a high-level goal related to your functional area. This will become your “focal point” (starting point for your cause and effect chart).
For example, the goal might be to eliminate or reduce recent statistics like:
· “20% gap in equipment availability”
· “$1,500,000 in unplanned maintenance expenses for 2008”
· “15 OSHA recordable injuries in 2008”
· “23,000,000 pounds of off-spec product”
1a. (Variation on the above.) If you are using an enterprise root cause analysis application that identifies common causes for you, pull up the “Common Cause Report” and then skip to step 5 below.
2. Review all root cause analysis that contribute to the gap in item 1 above and group them together under the focal point. Don’t worry if the root cause analysis are for completely different events. You are looking for common causes that show up over and over in many different types of events. (If you don’t have the files electronically, you can still follow the same process with Post-It notes and flip chart paper. However, you will need to allocate more time for the analysis.)
3. Do a complete scan of the cause and effect charts that contribute to the focal point. Look for and highlight the common causes; they will be there.
4. Pareto your common causes and focus on the top three common causes found.
5. Commission a team of subject-matter experts (SMEs), a root cause analysis master facilitator (highly skilled facilitator) and one or two people from leadership. The goal of this team is to reactivate the root cause analysis process and drill deeper to understand what is causing the common causes to exist and how to mitigate them.
· The root cause analysis master facilitator will use “soft skills” as well as knowledge of the RCA process to effectively lead the group.
· The SMEs will provide the technical expertise.
· The leaders will understand the details of and contribute the causes related to the systems. Leadership’s participation will also assure their personal ownership in the results, which will be critical in gaining support for the solutions.
6. Focus on the top three common causes by drilling further to the right for each of them. Target three more levels of causation beyond the common cause. The new causes added will be increasingly systemic the further to the right on the chart you go.
7. Identify solutions for the systemic causes.
8. Implement the solutions.
By implementing solutions for systemic causes, not only will problems be eliminated, they will be prevented in the future. Left unchecked, systemic causes will continue to generate problems in the future.
Importance of “Near-Miss” Events in Systemic Cause Analysis
Many organizations underestimate the importance of “near-miss” events in determining common systemic causes. Often, it takes a large or even catastrophic event to get the attention of the organization. However, because systemic causes are woven into the organizational fabric, they often play a contributing role in all events – regardless of the actual consequences realized.
Near misses rarely get the same level of attention as do those events of higher value. They are considered to be dodged bullets – an opportunity to be thankful for a positive, lucky outcome. At best, they cause a brief, unsustainable change in behavior.
However, the reality is that near misses share many of the same causes as their catastrophic counterparts. Often, the only difference is in the timing or distance of the event. The bearing in the main pump failed while the secondary pump was unavailable. Or the speed sensor failed while the airplane was in flight. Maybe the spark was generated in the vicinity of a leaking flammable. However, almost all of the non-transitory (conditional) causes remain the same.
Following are three case studies of systemic cause analysis.
Case Study 1
A manufacturer of specialized parts experienced a repetitive series of incidents where parts would unexpectedly fly out of a lathe while machining work was being performed. These parts weigh anywhere from 10 to 100 pounds. They are produced in short runs, often in small volumes. The nature of these parts requires their tolerances to be extremely accurate. The part cost ranged from $10,000-$100,000 per piece.
When a part is released from the lathe, it can be relatively benign and simply fall to the ground. Or it can fly through the air – landing up to 100 feet from the machine. Either way, there is a danger to employees, as well as to the part.
In this case, the team examined only five of the many previous events that had occurred over the years. (Interestingly, if you are doing a systemic cause analysis on a repetitive event — unlike doing a systemic analysis for a wide variety of events as described in the previous process — you do not have to do a deep dive examination on every adverse event. You can pull in a smaller sample of events to analyze.)
Since the parts were easily damaged, operators would use the lowest possible chuck pressure required to hold it in the lathe. In four out of five cases, evidence showed that the amount of pressure applied to the part when compared to the pressure gauge on the pneumatic chuck system varied widely. This was caused by a lack of maintenance on the machines, which led to build-up of dirt and grime in the chucks.
The systemic cause analysis found that the chucks were not being maintained because the lathes were built for heavy-duty use in the automotive industry and this company used the machines for a much lighter-duty task. Therefore, they assumed that this meant they could safely increase the maintenance interval.
Case Study 2
A specialty chemical company completed a Pareto analysis on exception codes on their quality incidents. The most frequently occurring exception code was “Test Result Out of Spec.” This related to the final quality check before shipment. Since their goal was to reduce the total number as well as relative severity of the most frequently occurring exception codes, they decided to examine a sample of these events to find out if they shared common systemic causes.
This case study took a different turn. In short, the company was not ready to perform a systemic cause analysis. Upon examination, the team recognized the relative weakness of their investigations; they didn’t know enough about their problems to proceed. None of the analyses went far enough to identify even the most basic causes, let alone common causes. There was little hope of finding systemic causes until more rigorous RCAs were completed.
The only common systemic issue identified was an incomplete/underdeveloped root cause analysis process. (By extension, it can be safely assumed that this problem likely extended into other analyses as well, not just quality.) The recommended course of action was to examine why the analyses were lacking detail. The answer was twofold:
1) Analysts had a great deal of technical experience in their respective areas of responsibility, but they had little formal investigation experience. They undervalued the importance of a formal causal analysis methodology and overestimated their own abilities.
2) Analysts were subject to confirmation bias. Once they chose the exception code that they felt best described each event, they tended to include information that supported their conclusion and exclude information that did not. This is an inherent risk of any system that relies on individuals to categorize exceptions or causes.
This team jumped the gun and wasn’t ready to perform a systemic cause analysis. The solution in the short term is to build proficiency in their root cause analysis program to the point where solid root cause analysis are in place that can be used in the future to do systemic cause analysis.
Case Study 3
A commodity chemical company wanted to understand the systemic causes related to the spills encountered at their site. A systemic analysis using the process previously described above was performed by consolidating 26 different root cause analysis on a wide variety of spills with a cumulative total of over 1,400 causes. The goal of the analysis was to identify the systemic causes for the spills. The results of the systemic analysis would then be used to develop the specific site environmental goal for the following year.
Four systemic cause themes emerged:
2) Acceptable risk/business decision
3) Operating discipline
4) Engineering design/controls (This did not contribute as many systemic causes as expected. Further, the common causes found were low in frequency and significance. The company has a strong engineering culture where lessons learned are shared effectively and institutionalized.)
The most frequently found cause was hidden failures on level switches and transmitters. These occurred in approximately one-third of all spills analyzed in the systemic cause analysis. Most of the hidden failures were in low-severity service, so essentially none had a scheduled PM or online diagnostics to check for hidden failures. Whether it was by design or omission, a run-to-failure maintenance strategy existed for most of these hidden failures, which was caused by the systemic cause of “cost optimization.”
The acceptable risk/business decision category could have easily been integrated with the reliability category. It involved essentially the same themes. The exception was that different leaders were responsible for setting the direction in each respective area. By separating the categories, a different focus would hopefully be created. Included in this category were leaks generated by failed piping, vessels and equipment. Most of these leaks were in low-severity service, and a run-to-failure strategy was cited as the cause for no failure-detecting PMs. Like the level devices, the systemic cause for no PMs was “cost optimization.”
The data from the systemic analysis now provided a basis for leaders to re-evaluate the run-to-failure strategy on the level devices and process containment systems, since both were contributing greatly to the undesirable environmental performance against goals. The frequency of the common causes necessitated a second look that would probably not have been given had the systemic analysis not been performed.
The results of the systemic analysis were utilized to develop the site environmental goal for the following years. In the first year after the systemic cause analysis was performed, the frequency rate of leaks was reduced by 20 percent.
Systemic causes are constants in the causal equation, regardless of outcome. They represent an elevated level of systemic risk. Systemic causes can be found by identifying common causes in actual events (or they can be found in most near-miss events as well) and drilling deeper. Systemic causes always lie to the right of common causes. Elimination of systemic causes prevents future incidents before they occur.
As an organization matures in capability to conduct systemic causal analysis, it will make the shift from reactive to proactive root cause analysis. It will recognize the importance of examining past events for common, systemic causes. By identifying and mitigating these systemic causes, they will in effect be reducing the total amount of systemic risk. This reduction leads to a real reduction in both the frequency and severity of adverse events in the future, ensuring more competitive long-term performance.
We previously published this article in the Reliable Plant 2016 Conference Proceedings.
By Chris Eckert, Sologic, and Bill Lyons, Holcim