RCA in IT: Root Cause Analysis for IT Environments

principles of root cause problem solving using fault diagnostics for troubleshooting

In the world of technology and software development, you are always trying out something new—only to test it again. Engineers learn from their mistakes and use them to grow their skillsets and improve processes. But some mistakes, like a major network or infrastructure failure, are less forgiving. The result of these unintended problems is a thing of nightmares.

Fortunately, a systematic approach available helps engineers and developers find the beginning of a problem and discover what went wrong: root cause analysis (RCA).

RCA also integrates seamlessly with AIOps , enhancing predictive analysis and automated resolution in IT environments.

In this article, we’ll look at RCA in IT environments, including:

  • Defining RCA and why it might be necessary
  • Exploring RCA strategies, including the 5 whys
  • Understanding the many benefits of RCA

What is root cause analysis?

Root cause analysis (RCA) is a systematic process for finding and identifying the root cause of a problem or event.

RCA is based on the basic idea that having a truly effective system means more than just putting out fires all day. That’s why RCA starts with figuring out how, where, and why the issue appeared. Then it goes further: RCA strives to respond to that answer—in order to prevent it from happening again.

Originating in the field of aeronautical engineering, this method is now applied in virtually every industry, but with particular focus and benefits in software development . Finding the root cause of a software or infrastructure problem is a highly effective, quality engineering technique that many industries already mandate in their governance.

Root cause analysis is considered a reactive management approach. In the ITIL® framework for service management, for instance, incident management is a reactive move where you’re responding to a critical incident. Problem management, on the other hand, is a proactive approach wherein you’re seeking out problems to address. ( Learn more in Incident Management vs Problem Management. )

Why is root cause analysis necessary?

RCA delivers a wide range of advantages ( detailed below ), but it is dramatically beneficial in the continuous atmosphere of software development and information technology for several reasons:

  • Focuses on cause, not symptoms: RCA pinpoints the factors that contribute to the problem or event, helping to find the actual cause of the problem as opposed to just fixing resulting symptoms. Its depth also helps to avoid singling out one issue over others for a quick fix.
  • Reduces cost and time: By catching problems early, RCA significantly reduces cost and time spent, enabling developers to maintain an agile develpment environment and drive process improvement.
  • Improves system reliability: Delving into the root cause of issues enhances the reliability and performance of IT systems.
  • Promotes proactive problem management: RCA identifies potential issues before they escalate, allowing for proactive measures and reducing the likelihood of major disruptions.
  • Encourages a culture of learning: It fosters a culture of continuous improvement and learning within IT teams, improving their problem-solving skills.
  • Enhances customer satisfaction: Stable and reliable services, achieved through effective RCA, lead to higher customer satisfaction and trust.
  • Optimizes resource utilization: By avoiding repetitive troubleshooting and temporary fixes, RCA is cost-effective in the long run.
  • Facilitates risk mitigation: RCA aids in identifying and mitigating risks, thereby reducing the severity and frequency of incidents or system failures.

Although performing root cause analysis might feel time-consuming, the opportunity to eliminate or mitigate risks and root causes is undeniably worthwhile.

RCA principles

Some of the basic principles of RCA can help organizations ensure they are following the correct methodology:

  • Focusing on corrective measures of root causes is more effective than simply treating the symptoms of a problem or event.
  • Effective RCA is accomplished through a systematic process with evidence-backed conclusions.
  • There is usually more than one root cause for a problem or event
  • The focus of RCA, via problem identification, is WHY the event occurred—not who made the error.

How to perform root cause analysis

The specific map of root cause analysis may look slightly different across organizations and industries. But here are the most common steps, in order, to perform RCA:

Root Cause Analysis

Let’s look at these steps in detail.

  • Define the problem. When a problem or event arises, your first move is to contain or isolate all suspected parts of the problem. This will help contain the problem.
  • Gather data. Once you find the problem, compile all data and evidence related to the specific issue to begin understanding what might be the cause.
  • Identify any contributing issues. You might have hands-on experience or stories from others that indicate any additional issues.
  • Determine root cause. Here’s where your root cause analysis really occurs. You can use a variety of RCA techniques (detailed below). Each technique helps you search for small clues that may reveal the root cause, allowing the person or team to correctly identify what went wrong.
  • Implement the solution. Determining the root cause will likely indicate one or several solutions. You might be able to implement the solution right away. Or, the solution might require some additional work. Either way, RCA isn’t done until you’ve implemented a solution.
  • Document actions taken. After you’ve identified and solved the root problem, document the problem and the overall resolution so that future engineers can use it as a resource.

Even if you don’t expect the problem to occur again, plan as if it will.

Remember, in order to have an effective RCA it is important that the team recognizes that processes cause the problems not people. Pointing fingers and placing blame on specific workers will not solve anything.

( Learn more about the importance of a blameless culture when performing an incident postmortem —the final step of your root cause analysis.)

Methods for root cause analysis

You can perform RCA using a variety of techniques. We highlight four well-known RCA techniques below—use the technique that meets your specific situation. Here’s a simple distinction:

  • Five whys analysis is good for initial troubleshooting.
  • Fishikawa diagrams are helpful for identifying all possible root causes for a situation.
  • Pareto charts help you prioritize which root causes should be addressed first, based on how often each identified root cause occurs.
  • Scatter plots are helpful in situations where you can identify and collect data on fluctuating variables that are related to the problem you are studying.

Take a look at these options and consider which might be best for your situation:

The 5 whys analysis

One of the easiest and most common tools for conducting a root cause analysis is the “five whys” method. Mimicking curious children, the five whys method literally suggests that you ask “why?” five times in a row in order to identify the root cause of any process or problem.

Five why analysis is effective because it is easy to use for solving problems where there is a single root cause.

Although the method seems explicit enough, this approach is still meant to be flexible depending on the scenario. Sometimes five whys will be enough. Other times, you’ll need to ask w hy a few more times. Y ou could also use additional techniques to identify the root cause.

To begin this method, follow this outline:

  • Write down the specific problem that needs to be fixed, describing it completely.
  • Ask why the problem happened. Write the answer below.
  • If your first question did not find the root cause, ask why again and write that answer down.
  • Continue this process until the team agrees you’ve identified the root cause of the problem.

( See the five whys in action with a simple RCA example, below. )

Pareto charts

Pareto charts identify the most significant factor among a large set of factors causing a problem or event. A Pareto chart is a combined bar and line chart, where the factors are plotted as bars arranged in descending order. The chart is accompanied by a line graph showing the cumulative totals of each factor, left to right.

Pareto Charts

Ishikawa diagrams

You might know the Ishiwaka diagram by other names: the fishbone, the herringbone, the cause-and-effect, and, our favorite, the Fishikawa diagram.

 Ishiwaka Diagram

The Ishikawa diagram is a great visualization tool for brainstorming and discovering multiple root causes. It is shaped like a fish skeleton, with the head on the right and the possible causes shown as fishbones to the left.

Scatter diagrams (plots)

Scatter diagrams , or scatter plots, use regression analysis to graph pairs of numerical data to determine relationships. This is helpful to identify problems and events that occur because of fluctuating measurements, such as capacity issues that happen when server traffic increases.

Scatter Diagrams

( Learn how to create your own scatter plots using Matplotlib. )

FMEA (failure mode and effects analysis)

FMEA is a systematic, step-by-step approach for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service. It’s particularly valuable for preemptively addressing potential failures and enhancing reliability. FMEA focuses on identifying failure modes and their causes, and effects, enabling teams to prioritize the risks and implement effective control measures. This process is instrumental in improving safety, increasing customer satisfaction, and reducing costs by catching issues early in the development cycle.

Fault tree analysis (FTA)

FTA is a top-down, deductive analytical method used to identify and analyze the potential causes of system failures. Starting with a known problem (or ”top event”), FTA uses logic diagrams to map out the various intersecting paths that could lead to the failure. This approach is essential for understanding complex systems, where multiple factors may interact to cause a failure. FTA is widely employed in safety engineering and reliability engineering to anticipate potential problems, thus aiding in the development of more reliable and safer systems.

RCA example using five whys analysis

Here is a simple five whys analysis where we try to determine why a computer is not turning on. At each step, we ask why the computer is not turned on. We gather data as we follow the power flow, until we finally determine that the power strip the computer plugged into is turned off.

Here’s what the user has reported: Their desktop computer is not turning on. The monitor is turned on, but the user does not hear the computer fan running, and there are no power lights.

RCA example using 5-Why Analysis

Root cause analysis using the 5 Whys to troubleshoot a computer that won’t turn on

Resolution : Technician turned on the surge protector and the computer came back on again.

Benefits of root causes analysis

The main benefit of root cause analysis is obvious: identifying problems so you can solve them. RCA offers plenty more benefits that help to solidify its usefulness and importance in the tech environment.

Solve real-world problems

When the right employees get the right RCA and resolution training, you’ll execute correct processes and solve common business problems.

Lower costs

When you catch problems quickly, you reduce the likelihood that those problems will turn into major incidents—especially when RCA is used to support an agile environment. RCA saves valuable employee time and ensures the organization doesn’t other fines or compromises.

Make the workplace safer

Employee safety is vital, and root cause analysis provides an added peace-of-mind. By quickly and effectively investigating any safety incidents, you can solutions can be put into place to prevent anything similar from happening again down the line.

Implement effective, long-lasting solutions

When you follow RCA analysis all the way through to final documentation, you focus on long-term prevention. It also shows that your organization prioritizes solutions—not speedy workarounds.

This forward thinking enables companies to become proactive and productive.

Resolve technical debt, strengthen code base

An RCA may show the problem is broken code due to technical debt . If the problem occurred due to changed business requirements, code development compromises, poor coding practices, or software entropy, the real solution may be refactoring rather than patching. Refactoring realigns your code with desired business outcomes, eliminates technical debt, and brings it up to current standards for future agile deployments.

Limitations of RCA

To effectively implement RCA, it’s essential to understand its limitations. While RCA is a powerful tool for problem-solving and prevention in IT and other industries, it’s important to recognize its boundaries and the challenges that can arise. This awareness not only helps in applying RCA more effectively, but also in integrating it with other strategies for a more comprehensive approach to problem-solving.

RCA is primarily a diagnostic process. It focuses on identifying the underlying causes of problems or incidents to prevent their recurrence. However, like any diagnostic method, it has its constraints:

  • Time-consuming: Comprehensive RCA can be a lengthy process, delaying immediate corrective actions.
  • Complexity in large systems: In highly complex systems, identifying a single root cause can be challenging.
  • Subject to bias: RCA outcomes can be influenced by individual or team biases.
  • Not a panacea: RCA may not address systemic issues or external factors beyond the organization’s control.
  • Resource-intensive: Effective RCA often requires significant resources such as tools and skilled personnel.

By understanding these limitations, IT professionals can better navigate the RCA process, ensuring a more balanced and effective approach to problem-solving. It’s also crucial to combine RCA with other methodologies and insights, forming a multifaceted strategy that addresses not just the “what” and “why” of problems, but also the “how” of future prevention and improvement.

Effective RCA saves more than money

Taking the time to create a robust root cause analysis process may take some time and effort in the initial stages, but it is an investment that will extend far beyond the expenses. The skills learned during the RCA process can be carried over to almost every other problem or field and initiate an attitude of continuous improvement—and even innovation .

This culture will surely permeate your organization for the better.

Root cause analysis examples by industry

RCA is a versatile tool applied across various industries, each with unique challenges and requirements. Its adaptability and effectiveness in identifying the underlying causes of issues make it an invaluable technique in diverse settings. From healthcare to retail, RCA provides critical insights that drive improvements, enhance efficiency, and prevent future problems.

Each industry is faced with unique challenges:

  • Healthcare : In healthcare, RCA is used to understand patient safety incidents or equipment failures.
  • Financial services : Financial institutions use RCA to analyze system failures or security breaches impacting transactions.
  • Manufacturing : RCA in manufacturing often focuses on production line errors or equipment malfunctions.
  • IT: In IT, RCA helps in troubleshooting network outages, software bugs, or security incidents.
  • Retail: Retailers use RCA to address supply chain disruptions or customer service issues.

By examining root cause analysis examples from different sectors, we can gain a deeper appreciation of its versatility and effectiveness in solving complex problems and driving continuous improvement in diverse organizational contexts.

Root cause analysis and AIOps

RCA in the context of AIOps represents a powerful combination for IT environments, offering advanced tools for predictive analysis and automated issue resolution.

AIOps leverages artificial intelligence and machine learning (AI/ML) and big data analytics to enhance IT operations, and when combined with RCA, it provides a powerful mechanism for identifying, as well as predicting and preventing, IT issues.

AIOps enhances RCA by automating data collection and analysis, which allows for faster identification of root causes in real time. This integration leads to more proactive and predictive IT management. For instance, AIOps can analyze patterns and anomalies across vast datasets, detecting potential issues before they escalate into major problems. This predictive capability is crucial for maintaining system health and ensuring uninterrupted service delivery.

Moreover, AIOps enables more sophisticated RCA by handling complex, multi-layered IT environments where traditional RCA might struggle. It sifts through the noise of vast data sets to pinpoint accurate root causes, reducing the time IT teams spend on troubleshooting and increasing their efficiency.

The combination of RCA and AIOps represents a shift from reactive to proactive and predictive IT management. It helps resolve current issues while also anticipating and preventing future disruptions. This proactive approach is essential for businesses that rely on IT infrastructure for their critical operations, as it minimizes downtime and ensures a more stable and reliable IT environment.

By integrating RCA with AIOps, organizations can harness the power of advanced analytics and AI to transform their IT operations, making them more resilient, efficient, and aligned with business objectives.

Learn more about BMC’s AIOps solutions ›

Additional resources.

For more on this topic, explore these resources:

  • BMC Service Management Blog
  • BMC Business of IT Blog
  • Data Visualization Guide , with tutorials on creating charts and graphs
  • How To Build Your ITSM Business Case (Free Template Included)
  • Resilience Engineering: An Introduction

How to evolve IT to drive digital business success

When IT and the business are on the same page, digital transformation flows more easily. In this e-book, you’ll learn how IT can meet business needs more effectively while maintaining priorities for cost and security.

principles of root cause problem solving using fault diagnostics for troubleshooting

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing [email protected] .

Business, Faster than Humanly Possible

BMC works with 86% of the Forbes Global 50 and customers and partners around the world to create their future. With our history of innovation, industry-leading automation, operations, and service management solutions, combined with unmatched flexibility, we help organizations free up time and space to become an Autonomous Digital Enterprise that conquers the opportunities ahead. Learn more about BMC ›

You may also like

principles of root cause problem solving using fault diagnostics for troubleshooting

The Keys to IT Cost Management

principles of root cause problem solving using fault diagnostics for troubleshooting

Business vs IT vs Digital Transformation: Strategy Across 3 Critical Domains

principles of root cause problem solving using fault diagnostics for troubleshooting

Event Stream Processing Explained

principles of root cause problem solving using fault diagnostics for troubleshooting

Going Beyond Controlling Costs to Optimizing Spend with SaaS

principles of root cause problem solving using fault diagnostics for troubleshooting

What is an Innovation Outpost?

principles of root cause problem solving using fault diagnostics for troubleshooting

Resiliency vs Redundancy: What’s the Difference?

About the author.

' src=

Laura Shiff

Laura Shiff is a researcher and technical writer based in the Twin Cities. She specializes in software, technology, and medicine. You can reach Laura at [email protected]

' src=

Joe Hertvik

Joe Hertvik PMP owns Hertvik & Associates, an IT infrastructure and marketing management consultancy. Joe provides contract services for IT environments including Project Management, Data Center, network, Infrastructure, and IBM i management.

His company also provides Marketing, content strategy, and content production services for B2B IT industry companies. Joe has produced over 1,000 articles and IT-related content for various publications and tech companies over the last 15 years.

Joe can be reached via email at [email protected] or LinkedIn .

Image

  • RCA 101 – 5-Why Analysis (Free Training)
  • RCA201 – Basic Failure Analysis
  • RCA 301 – PROACT® RCA Certification
  • RCA401 – RCA Train The Trainer
  • Other Trainings
  • 5 Whys Root Cause Analysis Template
  • RCA Template
  • Chronic Failure Calculator

7 Powerful Root Cause Analysis Tools and Techniques

Sebastian Traeger

By Sebastian Traeger

Updated: April 21, 2024

Reading Time: 5 minutes

1. The Ishikawa Fishbone Diagram (IFD)

2. pareto chart, 4. failure mode and effects analysis (fmea), 5. proact® rca method, 6. affinity diagram, 7. fault tree analysis (fta).

With over two decades in business – spanning strategy consulting, tech startups and executive leadership – I am committed to helping your organization thrive. At Reliability, we’re on a mission to help enhance strategic decision-making and operational excellence through the power of Root Cause Analysis, and I hope this article will be helpful!  Our goal is to help you better understand these root cause analysis techniques by offering insights and practical tips based on years of experience. Whether you’re new to doing RCAs or a seasoned pro, we trust this will be useful in your journey towards working hard and working smart.

Root Cause Analysis (RCA) shines as a pivotal process that helps organizations identify the underlying reasons for problems, failures, and inefficiencies. The goal is simple: find the cause, fix it, and prevent it from happening again. But the process can be complex, and that’s where various RCA techniques come into play. 

Let’s dive into seven widely utilized RCA techniques and explore how they can empower your team’s problem-solving efforts.

Named after Japanese quality control statistician Kaoru Ishikawa, the Fishbone Diagram is a visual tool designed for group discussions. It helps teams track back to the potential root causes of a problem by sorting and relating them in a structured way. The diagram resembles a fishbone, with the problem at the head and the causes branching off the spine like bones. This visualization aids in categorizing potential causes and studying their complex interrelationships.

The-Ishikawa- -IFD

The Pareto Chart, rooted in the Pareto Principle, is a visual tool that helps teams identify the most significant factors in a set of data. In most situations, 80% of problems can be traced back to about 20% of causes. By arranging bar heights from tallest to shortest, teams can prioritize the most significant factors and focus their improvement efforts where they can have the most impact.

Pareto Chart - Quality Improvement - East London NHS Foundation Trust :  Quality Improvement – East London NHS Foundation Trust

The 5 Whys method is the epitome of simplicity in getting to the bottom of a problem. By repeatedly asking ‘why’ (typically five times), you can delve beneath the surface-level symptoms of a problem to unearth the root cause. This iterative interrogation is most effective when answers are grounded in factual evidence.

5 Why Image 2

When prevention is better than cure, Failure Mode and Effects Analysis (FMEA) steps in. This systematic, proactive method helps teams identify where and how a process might fail. By predicting and examining potential process breakdowns and their impacts, teams can rectify issues before they turn into failures. FMEA is a three-step process that involves identifying potential failures, analyzing their effects, and prioritizing them based on severity, occurrence, and detection ratings.

Failure Mode and Effects Analysis (FMEA)

The PROACT ® RCA technique is a robust process designed to drive significant business results. Notably used to identify and analyze ‘chronic failures,’ which can otherwise be overlooked, this method is defined by its name:

PReserving Evidence and Acquiring Data: Initial evidence collection step based on the 5-P’s – Parts, Position, People, Paper, and Paradigms.

Order Your Analysis Team and Assign Resources: Assembling an unbiased team to analyze a specific failure.

Analyze the Event: Reconstructing the event using a logic tree to identify Physical, Human, and Latent Root Causes.

Communicate Findings and Recommendations: Developing and implementing solutions to prevent root cause recurrence.

Track and Measure Impact for Bottom Line Results: Tracking the success of implemented recommendations and correlating the RCA’s effectiveness with ROI.

PROACT® RCA excels in mitigating risk, optimizing cost, and boosting performance, making it a valuable addition to any RCA toolkit.

PROACT Performance Process (P3)

The Affinity Diagram is a powerful tool for dealing with large amounts of data. It organizes a broad range of information into groups based on their natural relationships, creating a clear, visual representation of complex situations. It’s particularly beneficial for condensing feedback from brainstorming sessions into manageable categories, fostering a better understanding of the broader picture.

Affinity Diagram

Fault Tree Analysis (FTA) is a top-down, deductive failure analysis that explores the causes of faults or problems. It involves graphically mapping multiple causal chains to track back to possible root causes, using a tree-like diagram. FTA is particularly useful in high-risk industries, such as aerospace and nuclear power, where preventing failure is crucial.

Fault Tree Analysis (FTA)

Each RCA technique provides a unique approach for viewing and understanding problems, helping you pinpoint the root cause more effectively. The key is to understand when and how to use each tool, which can significantly enhance your team’s problem-solving capabilities.

Power up your RCA analysis with our EasyRCA and revolutionize your problem-solving process. Start Your Free Trial.

Ishikawa Fishbone DiagramVisual representation of complex relationshipsWhen there are many possible causes to a problem
Pareto ChartPrioritizes problem areas based on impactWhen trying to identify the most significant causes
5 WhysSimple, iterative problem-solving techniqueWhen the problem is straightforward and the solution is not immediately apparent
FMEAProactive, preventative approachWhen addressing complex processes that could lead to serious consequences if failed
PROACT® RCA MethodComprehensive, result-driven approachWhen dealing with chronic, recurrent failures
Affinity DiagramGroups large data into manageable categoriesWhen trying to find patterns and connections in large amounts of data
Fault Tree Analysis (FTA)Visual mapping of causal chainsWhen working in high-risk industries where prevention is crucial

In conclusion, the techniques presented offer a diverse set of tools to help organizations address problems and inefficiencies effectively. From visual representations like the Ishikawa Fishbone Diagram and Pareto Chart to more proactive approaches such as the 5 Whys and Failure Mode and Effects Analysis (FMEA), each technique provides a unique perspective on identifying and mitigating root causes.

The PROACT® RCA Method stands out for its comprehensive process, particularly suited for chronic failures. Additionally, the Affinity Diagram and Fault Tree Analysis (FTA) contribute valuable insights by organizing data and exploring causal chains, respectively. Leveraging these techniques strategically enhances a team’s problem-solving capabilities, enabling them to make informed decisions and drive continuous improvement.

I hope you found these 7 techniques insightful and actionable! Stay tuned for more thought-provoking articles as we continue to share our knowledge. Success is rooted in a thorough understanding and consistent application, and we hope this article was a step in unlocking the full potential of Root Cause Analysis for your organization. Reliability runs initiatives such as an online learning center focused on the proprietary PROACT® RCA methodology and EasyRCA.com software. For additional resources, visit our Reliability Resources .

  • Root Cause Analysis /

Recent Posts

How to Perform Root Cause Investigations?

Post-Incident Analysis for Enhanced Reliability

How To Conduct Incident Analysis?

The Role of Artificial Intelligence in Reliability Engineering

Root Cause Analysis Software

Our RCA software mobilizes your team to complete standardized RCA’s while giving you the enterprise-wide data you need to increase asset performance and keep your team safe.

Root Cause Analysis Training

[email protected]

Tel: 1 (800) 457-0645

Share article with friends:

Status.net

Root Cause Analysis (RCA) Methods for Effective Problem Solving

By Status.net Editorial Team on May 8, 2023 — 7 minutes to read

Imagine facing a problem in your organization that keeps recurring despite your best efforts to solve it. You might be addressing the symptoms, but not the underlying cause. This is where root cause analysis (RCA) comes into play. RCA is a systematic approach to identifying the root cause of problems or events, understanding how to fix or compensate for them, and applying the knowledge gained to prevent future issues or replicate successes. In this comprehensive guide to root cause analysis, you’ll learn various methods and techniques for conducting an RCA. You’ll understand how to gather and manage evidence, investigate the people, processes, and systems involved, and determine the key factors leading to the problem or event.

Related: 3 Root Cause Analysis Templates (and Examples)

  • 5 Whys: How to Uncover Root Causes [Examples]

Root Cause Analysis Fundamentals

Root Cause Analysis (RCA) is a systematic approach to identify the underlying cause of a problem. By focusing on the root cause, you can effectively address the issue and prevent recurrence. Generally, RCA is used to investigate incidents, eliminate defects, and enhance systems or processes.

RCA aims to achieve the following objectives:

  • Determine the root cause of a problem or issue, not just its symptoms.
  • Identify and implement solutions that address the root cause and prevent its recurrence.
  • Improve understanding of the systems, processes, or components involved to avoid similar issues in the future.
  • Foster a proactive and continuous improvement mindset within your organization.

The RCA Process

Problem identification.

To effectively utilize Root Cause Analysis (RCA), first identify the problem at hand. Determine the specific issue, incident, or failure that needs to be investigated. Clearly define the problem and its impact on your organization’s operations in order to establish a focused and valuable analysis.

Data Collection

Gather relevant data about the problem, including when and where it occurred, who was involved, what processes and systems were affected, and any other important context. Be thorough and systematic in your data collection, and make use of any available documentation, interviews, or observations to build a comprehensive understanding.

Cause Identification

Analyze the collected data to pinpoint potential causes of the problem. This could start with brainstorming and then using tools such as cause-and-effect diagrams or the “5 Whys” technique to delve deeper into the issue. Determine the causes that are most likely to have contributed to the problem and classify them as either root causes or contributing factors.

Solution Implementation

Once you have identified the root cause(s) of the problem, develop and execute an action plan to address the issue. Design solutions that specifically target the root cause(s) to eliminate them from your processes, rather than simply addressing the symptoms of the problem. Implement the appropriate changes to your processes or systems and ensure that all stakeholders are aware of these changes.

Follow-up and Monitoring

After implementing the solutions, monitor the results to ensure they are effective in addressing the root cause(s) and preventing the problem from reoccurring. Collect and analyze data regularly to evaluate the impact of the implemented solutions on your organization’s performance. Adjust and refine the solutions if necessary, and maintain ongoing vigilance in order to identify any future problems that may arise from the same root cause(s).

RCA Techniques

The 5 Whys technique is a straightforward method for identifying the root cause of a problem. To employ this approach, you simply ask “why” five times, with each question delving deeper into the issue. The process helps trace the problem to its origin by examining each level of cause and effect.

  • Why did the machine stop working?
  • Why did the fuse blow?
  • Why did the motor overheat?
  • Why was there insufficient lubrication on the motor?
  • Why was the lubrication schedule not followed?

In this case, the root cause is the failure to adhere to the lubrication schedule.

Learn more: 5 Whys: How to Uncover Root Causes [Examples]

Fishbone Diagram

The Fishbone Diagram, also known as the Ishikawa Diagram or cause-and-effect diagram, is a visual tool that helps you organize and sort potential root causes. To create a Fishbone Diagram:

  • Write down the problem statement at the head of the fishbone structure.
  • Identify major categories of causes, such as people, process, equipment, and environment. Draw lines connecting them to the problem statement.
  • Assign specific causes under each category and draw smaller lines connecting them to the respective major categories.
  • Analyze the diagram to find trends, patterns, or potential areas of focus.

Learn more: Fishbone Diagram (Components, Factors, Examples) and Ishikawa Diagram: Examples and Applications

Failure Modes and Effects Analysis (FMEA) is a systematic approach to identify potential failures and evaluate the consequences. FMEA processes typically involve these steps:

  • Identify potential failure modes, which are the ways something could go wrong.
  • Determine the potential effects of each failure mode, and how it could impact the overall system or process.
  • Assign a risk priority number (RPN) to each failure mode, considering factors such as likelihood, severity, and detectability.
  • Develop actions and strategies to mitigate high-risk failure modes.

By using FMEA, you can proactively address possible issues before they escalate, and maintain a more reliable process or system.

Barrier Analysis

Barrier Analysis focuses on preventing problems by examining the barriers in place to control risks. The objective is to identify vulnerabilities in these barriers and develop strategies for improvement. The steps of Barrier Analysis include:

  • Identify hazards and risks associated with your system or process.
  • Define the barriers in place that protect against these hazards.
  • Evaluate the effectiveness, strength, and reliability of each barrier.
  • Identify gaps or weaknesses in the barriers.
  • Develop and implement improvements to strengthen the barriers.

This method provides a clear understanding of how existing safety measures perform and how they can be improved to better protect against potential issues.

See also: 3 Root Cause Analysis Templates (and Examples)

  • What is Poka-Yoke? [Examples, Principles, Methods]

Benefits of Root Cause Analysis

Quality improvement.

Root cause analysis can significantly enhance the quality of your products or services. By systematically identifying the root causes of issues and implementing corrective actions, you’ll prevent recurring problems and reduce the number of defects. In turn, this will help you maintain customer satisfaction, reduce costs associated with rework or returns, and improve your reputation in the market.

Risk Reduction

Reducing risk is another advantage of root cause analysis. When you identify the underlying causes of problems, you can take necessary measures to eliminate or mitigate those risks. This proactive approach can protect your business from potential losses or disruptions, such as regulatory penalties, customer dissatisfaction, or harm to employees or the environment. By addressing the sources of risk, you can maintain a safer and more profitable business.

Process Optimization

Root cause analysis supports continuous improvement by highlighting inefficiencies and areas for optimization in your operations. By examining your processes beyond the symptoms of a specific issue, you can uncover opportunities to streamline workflows, reduce waste or downtime, and better utilize resources. Implementing these improvements not only resolves the immediate problem but also enhances overall productivity and efficiency in your organization.

Challenges of Root Cause Analysis

Common pitfalls.

When conducting Root Cause Analysis (RCA), you might face common pitfalls that can reduce the effectiveness of your investigation. Some of these pitfalls include:

  • Rushing the process : It is important to allocate appropriate time and resources to conduct a thorough RCA.
  • Overlooking small details : Make sure to pay attention to all possible contributing factors when investigating a problem. Small details can often hold the key to the root cause.
  • Focusing on blame : RCA should focus on identifying systemic issues and providing solutions rather than blaming individuals or departments.

Addressing Human Factors

Human factors play a critical role in many problems. When conducting RCA, it is crucial to consider the human factors that may have contributed to the issue. Here are some tips to help you address human factors in your RCA:

  • Consider psychological factors : Assess the mental state of the people involved in the incident, including their level of stress, fatigue, and emotions.
  • Evaluate communication and collaboration : Analyze how effectively teams were communicating and working together at the time of the incident.
  • Assess training and competency : Determine if the people involved had the appropriate training and knowledge to handle the situation.
  • Fishbone Diagram (Components, Factors, Examples)
  • Ishikawa Diagram: Examples and Applications
  • 3 Root Cause Analysis Templates (and Examples)
  • Self Evaluation Examples [Complete Guide]
  • Memberships

Root Cause Analysis (RCA): Definition, Process and Tools

root cause analysis rca toolshero

Root Cause Analysis (RCA): this article explains the Root Cause Analysis or RCA in a practical way. The article starts with a general definition of this concept, followed by the five approaches to the RCA and a practical Root Cause Analysis example. This article also contains a Root Cause Analysis template. Enjoy reading!

What is a Root Cause Analysis (RCA)?

Root Cause Analysis (RCA) is a method of problem solving that aims at identifying the root causes of problems or incidents.

RCA is based on the principle that problems can best be solved by correcting their root causes as opposed to other methods that focus on addressing the symptoms of problems or treating the symptoms.

Free Toolshero ebook

Through corrective actions, the underlying causes are addressed so that recurrence of the problem can be minimized. It is utopian to think that a single corrective action will completely prevent recurrence of the problem. This is why root cause analysis is often considered to be an iterative process.

This problem solving method is often used when something goes wrong, but is also used when it goes well. More on this proactive attitude to problem solving later.

Root cause analyses, as well as incident investigations and other forms of problem solving, are fundamentally linked to the following three questions:

  • What is currently the problem?
  • Why does this problem occur?
  • What can be done to prevent this problem from happening again?

What is the goal of the Root Cause Analysis?

Root Cause Analysis is used as a tool for continuous improvement . If a RCA is used for the first time, it is a reactive way of identifying and solving problems. This means that an analysis is performed after a problem or incident has occurred.

By executing this analysis before problems from occur, its use changes from reactive to proactive, so that problems can be anticipated in time. RCA is not a strictly defined methodology. There are many different tools, processes and philosophies that have been developed based on Root Cause Analysis.

However, there are five approaches that can be identified in practice:

Safety-based

Its origin can be mainly be found in accident analyses, safety and healthcare.

Production-based

Its origin can be mainly be found in the area of quality control and industrial manufacturing.

Join the Toolshero community

Process-based

This is the follow-up from production and business processes .

Failure-based root

Its origin can be found in Engineering and maintenance.

Systems-based

Its origin can be found in the amalgamation of the approaches mentioned above and this is combined with ideas from change management, risk management and systems analysis.

Despite the fact that there seem to be no clear definition of the differences in the objectives among the various approaches, there are some common principles that can be considered to be universal. It is also possible to define a general process for performing an Root Cause Analysis.

Where is the Root Cause Analysis applied in practice?

The Root Cause Analysis is applied in many areas. Below are some examples where an RCA can make a difference.

When an industrial machine breaks down, an RCA can determine the cause of the defect.

If it turns out that a fuse has blown, the fuse can be replaced and the machine restarted, but then the machine will stop working again after a while.

By performing an RCA it is discovered that the problem lies with a pump in the automatic lubrication mechanism. By determining the root cause of the defect by means of an RCA, the same problem can be prevented after an appropriate response.

Information technology

RCA is also used in IT to track down the root causes of problems. An example of this is the computer security management process. It uses RCA to investigate security breaches.

The RCA is also used in the field of safety and health . Think of diagnoses made in medicine, identifying the source of an epidemic, accident analysis and occupational health.

Root Cause Analysis: the basic process

The basic process consists of a number of basic steps. These corrective measures will lead to the true cause of the problem.

Define the problem or the factual description of the incident

Use both qualitative and quantitative information (nature, size, locations and timing) of the results in question and find the root.

Collect data and evidence and classify

Collect data and evidence and classify them along a time line of incidents until the eventual problem or incident is found. Each special deviation in the form of behaviour, condition, action and passivity must be recorded in the time line.

Ask the why’s

Always ask ‘why’ to identify the effects and record the causes associated with each step in the sequence toward the defined problem or incident.

Classify the causes

Classify the causes within the causal factors that relate to a crucial moment in the sequence including the underlying causes. If there are multiple causes, which is often the case, document these, dig deeper, preferably in order of sequence for a future selection. Identify all other harmful factors and contributing factors.

Generate corrective actions / improvements

Think of corrective actions or improvement measures that will ensure prevention of recurrence with a sufficient degree of certainty.

Explore whether corrective actions or improvement measures can be simulated in advance so that the possible effects become noticeable, also with respect to the other underlying causes.

Think of effective solutions that can prevent recurrence of the causes and to which all involved colleagues and team members can agree. These solutions must comply with the intended goals and objectives and must not cause any new and unforeseen problems.

Implement solutions and monitor these

Implement the solutions (corrective actions) that have been made by consensus. Monitor the effectiveness of the solutions (corrective actions) closely and adjust if necessary.

Other methods for problem-solving and problem prevention may be useful. Identify and address any other causes that may be harmful factors in the process.

Please note : steps three, four and five are the most critical part of the corrective measures because these have proved to be successful in practice.

Root cause analysis tools

Other well-know Root cause analysis techniques and tools are listed below:

Barrier analysis

This root cause analysis technique is often used in the industrial sector.

It was developed to identify energy flows and focus on possible blocks for those flows in order to determine how and why the obstacles cannot prevent the energy flows from causing damage.

Current Reality Tree

This complex but powerful method developed by Eliahu M. Goldratt is based on representing causal factors in a tree structure. This method uses rules of logic. The method starts with a short list of the undesirable factors we see around us that will subsequently lead to one or more underlying causes.

Change analysis

This research methodology is often used for problems or accidents and demonstrates how the problem has presented itself from different perspectives.

5 times why

In the Japanese analysis method 5 whys the question ‘why’ is asked five times. The 5 whys technique was originally developed by Sakichi Toyoda , and was used to trace the root cause of the problems within the manufacturing process of Toyota Motors.

Fishbone diagram

This method is also known as the Ishikawa diagram. The Ishikawa diagram is a much preferred method of project managers to perform a Root Cause Analysis.

Kepner Tregoe method

The Kepner Tregoe Method is a method based on facts in which the possible causes are systematically excluded in order to find the real cause. This method also disconnects the problem is from the decision.

RPR Problem Diagnosis

This is an ITIL aligned method designed to determine the root cause of IT problems.

Core Principles of Root Cause Analysis

While there are many different approaches to Root Cause Analyses, most of the methods boil down to the following five steps.

Identification and description

Problem statements and event descriptions are very helpful and often required to perform a proper Root Cause Analysis. An outage is an example of a problem where this is particularly important.

The Root Cause Analysis must establish a sequence of events or a timeline before the relationship between causal factors can be understood.

Differentiation

It is important to distinguish between root cause, causal factors and non-causal factors. This is done by correlating the sequence of events with the size, nature, and timing of the problem. One way to detect underlying causal factors is to use clustering and data mining.

Finally, from the sequences of events, researchers must create an additional set of events that actually caused the problem. This is then converted into a causal graph. To be effective, the Root Cause Analysis must be performed systematically.

This form of problem solving is often a team effort. Think of the analysis of aircraft accidents. For this, the conclusions of researchers and identified causes must be supported by documented evidence.

Correcting measures

Taking corrective action is not formally part of the RCA as the goal is to eliminate the root cause of a problem. Still, it is an important step that is added to virtually all Root Cause Analyses. This step is therefore to add long-term corrective actions so the problem does not develop in the same way as before.

Root Cause Analysis training

There are various forms of training for managers and other persons for which it is important to carry out a correct RCA. These courses are ideal for people who need to understand Root Cause Analysis terminology and process for professional use. Participating in such training courses helps to understand the importance of identifying the root cause of a problem to ensure it does not recur. In addition, courses help to identify common barriers and problems in conducting a RCA.

Root Cause Analysis summary

A Root Cause Analysis (RCA) is a method for identifying the root causes of various problems. There are several methods and techniques that are used for this purpose: Fishbone Diagram, 5 whys method, Barrier Analysis and the Kepner Tregoe Method .

Although they all differ slightly from each other, the operation of the method can be summarized in three questions: what is the problem, why is this a problem, and what is being done to prevent this problem? In practice, a RCA is used in production facilities, in information technology and the health and safety industry.

Five elements are important in performing the RCA and always come back. First, it is imperative that there is a description and explanation of the events leading up to the identification of the problem. In addition, it is important to establish the correct chronology of these events. Subsequently, it must be possible to clearly distinguish between the root cause, causal factors and non-causal factors.

After this, researchers need to determine the sequence of events that almost certainly led to the problem. The final step usually consists of taking corrective action. While not a formal part of the Root Cause Analysis (RCA), this step is very important to ensure that the problem does not develop in the same way in the future as it did before.

Root Cause Analysis template

Start with the cause and effect analysis and identify the causes of problems with this ready to use Root Cause Analysis template.

Download the Root Cause Analysis template

It’s your turn.

What do you think? What is your Root Cause Analysis experience? Do you recognize the practical explanation or do you have additions? What are in your opinion success factors for conducting an RCA?

Share your experience and knowledge in the comments box below.

More information

  • Andersen, B. & Fagerhaug, T. (2006). Root cause analysis: simplified tools and techniques. ASQ Quality Press.
  • Barsalou, M. A. (2014). Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time . Productivity Press.
  • Dankovic, D. D. (2001). Root Cause Analysis . Technometrics, 43(3), 370-371.
  • George, M. L., Maxey, J., Rowlands, D. & Price, M. (2004). The Lean Six Sigma Pocket Toolbook: A Quick Reference Guide to 100 Tools for Improving Quality and Speed . McGraw-Hill Education .

How to cite this article: Van Vliet, V. (2010). Root Cause Analysis (RCA) . Retrieved [insert date] from Toolshero: https://www.toolshero.com/problem-solving/root-cause-analysis-rca/

Original publication date: 08/15/2010 | Last update: 06/12/2024

Add a link to this page on your website: <a href=”https://www.toolshero.com/problem-solving/root-cause-analysis-rca/”>Toolshero: Root Cause Analysis (RCA)</a>

Did you find this article interesting?

Your rating is more than welcome or share this article via Social media!

Average rating 4 / 5. Vote count: 3

No votes so far! Be the first to rate this post.

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Vincent van Vliet

Vincent van Vliet

Vincent van Vliet is co-founder and responsible for the content and release management. Together with the team Vincent sets the strategy and manages the content planning, go-to-market, customer experience and corporate development aspects of the company.

Related ARTICLES

Total Quality Management - Toolshero

Total Quality Management (TQM): Meaning and Explanation

Drum Buffer Rope - Toolshero

Drum Buffer Rope Theory: the Concept explained

Burndown chart - Toolshero

Burndown Chart: Theory and Excel Template

hoshin kanri matrix toolshero

Hoshin Kanri Matrix: Theory and a Template

theory of constraints toc toolshero

Theory of Constraints by Eliyahu Goldratt

Eight Dimensions of Quality - Toolshero

Eight Dimensions of Quality by David Garvin

Also interesting.

VRIO Analysis - toolshero

VRIO Analysis: the Basics, an Example and a Template

SWOT analysis - Toolshero

SWOT Analysis explained including a template

Quantitative Strategic Planning Matrix - Toolshero

Quantitative Strategic Planning Matrix / QSPM: Theory and Template

Leave a reply cancel reply.

You must be logged in to post a comment.

BOOST YOUR SKILLS

Toolshero supports people worldwide ( 10+ million visitors from 100+ countries ) to empower themselves through an easily accessible and high-quality learning platform for personal and professional development.

By making access to scientific knowledge simple and affordable, self-development becomes attainable for everyone, including you! Join our learning platform and boost your skills with Toolshero.

principles of root cause problem solving using fault diagnostics for troubleshooting

POPULAR TOPICS

  • Change Management
  • Marketing Theories
  • Problem Solving Theories
  • Psychology Theories

ABOUT TOOLSHERO

  • Free Toolshero e-book
  • Memberships & Pricing

Try Toolshero now for free

Software Testing Help

Guide To Root Cause Analysis – Steps, Techniques & Examples

principles of root cause problem solving using fault diagnostics for troubleshooting

This Tutorial Explains What is Root Cause Analysis and Different Root Cause Analysis Techniques like Fishbone Analysis and 5 Whys Technique:

RCA (Root Cause Analysis) is a structured and effective process to find the root cause of issues in a Software Project team. If performed systematically, it can improve the performance and quality of the deliverables and the processes, not only at the team level but also across the organization.

This tutorial will help you define and streamline the Root Cause Analysis process in your team or organization.

Root Cause Analysis

This tutorial is intended for Delivery Managers, Scrum Masters, Project Managers, Quality Managers, Development Team, Test Team, Information Management Team, Quality Team, Support Team, etc. to understand the basics of Root Cause Analysis and provides templates and examples of it.

Table of Contents:

What Is Root Cause Analysis?

Advantages of root cause analysis, types of root causes, steps to do root cause analysis, #1) fishbone analysis, #2) the 5 whys technique, factors causing defects, was this helpful, recommended reading.

RCA (Root Cause Analysis) is a mechanism of analyzing the Defects, to identify its cause. We brainstorm, read and dig the defect to identify whether the defect was due to “ testing miss ”, “ development miss ” or was a “ requirement or designs miss ”.

When RCA is done accurately, it helps to prevent defects in the later releases or phases. If we find, that a defect was due to design miss , we can review the design documents and can take appropriate measures. Similarly, if we find that a defect was due to testing miss , we can review our test cases or metrics, and update it accordingly.

RCA should not be limited only to testing the defects. We can do RCA on production defects as well. Based on the decision of RCA, we can enhance our Test Bed and include those production tickets as Regression Test cases. This will ensure that the defect or similar kinds of defects are not repeated.

Root Cause Analysis Process

Introduction

RCA is not only used for defects reported from a customer site, but also for UAT defects, Unit Testing defects, Business, and Operational process-level problems, day-to-day life problems, etc. Hence it is used in multiple industries like Software Sector, Manufacturing, Health, Banking Sector, etc.

Conducting Root Cause Analysis is similar to the work of the doctor who treats a patient. The doctor will first understand the symptoms. Then he will refer to laboratory tests to analyze the root cause of the disease.

If the root cause of the disease is still unknown, the doctor will refer for scan tests to understand further. He will continue the diagnosis and study until he narrows down to the root cause of the patient’s sickness. The same logic applies to Root Cause Analysis performed in any industry.

So, RCA is aimed at finding the root cause and not treating the symptom, by following a specific set of steps and associated tools. It is different from defect analysis, troubleshooting, and other problem-solving methods as these methods try to find the solution for the specific issue, but RCA tries to find the underlying cause.

Origin of the name Root Cause Analysis:

Origin of the name Root Cause Analysis

[image source ]

Leaves, trunk, and roots are the most important parts of a tree. Leaves [Symptom] and trunk [Problem] which are above the ground are visible, but roots [Cause] which are under the ground aren’t visible and roots grow deeper and can spread further more than we expect. Hence, the process of digging to the bottom of the issue is called Root Cause Analysis.

Enlisted below are some of the benefits, you will get:

  • Prevent the reoccurrence of the same problem in the future.
  • Eventually, reduce the number of defects reported over time.
  • Reduces developmental costs and saves time.
  • Improve the software development process and hence aiding quick delivery to market.
  • Improves customer satisfaction.
  • Boost productivity.
  • Find hidden problems in the system.
  • Aids in continuous improvement.

#1) Human Cause: Human-made error.

  • Under skilled.
  • Instructions not duly followed.
  • Performed an unnecessary operation.

#2) Organizational Cause: A process that people use to make decisions that were not proper.

  • Vague instructions were given from Team Lead to team members.
  • Picking the wrong person for a task.
  • Monitoring tools not in place to assess the quality.

#3) Physical Cause: Any physical item failed in some way.

  • The computer keeps restarting.
  • The server is not booting up.
  • Strange or loud noises in the system.

A structured and logical approach is required for an effective root cause analysis. Hence, it’s necessary to follow a series of steps.

Steps to do Root Cause Analysis

#1) Form RCA Team

Every team should have a dedicated Root Cause Analysis Manager [RCA Manager] who will collect the details from the Support team and initiate the kick-off process for RCA. He will coordinate and allocate resources who need to attend RCA meetings depending on the stated problem.

Teams, who attend the meeting, should have personnel from each team [Requirement, Design, Testing, Documentation, Quality, Support & Maintenance] who are most familiar with the problem. The team should have people who are directly linked to the defect as well. For example, the Support engineer who gave an immediate fix to the customer.

Share the problem details with the team before attending the meeting so that they can do some initial analysis and come prepared. Team members also gather information related to the defect. Depending on the incident report, each team will trace what went wrong w.r.t to this scenario in their respective phases. Being prepared will increase the efficiency of the upcoming discussion.

#2) Define The Problem

Collect the details of the problem like, incident reports, problem evidence (screenshot, logs, reports, etc.), then study/analyze the problem by asking the below questions:

  • What is the problem?
  • What is the sequence of events that led to the problem?
  • What systems were involved?
  • How long the problem existed?
  • What is the impact of the problem?
  • Who was involved and determine who should be interviewed?

Use ‘SMART’ rules to define your problem:

  • M EASURABLE
  • A CTION-ORIENTED
  • T IME-BOUND

John Dewey quote

#3) Identify Root Cause

Conduct the BRAINSTORMING session within the RCA team formed to identify the causes. Use the Fishbone diagram or 5 Why Analysis method or both to arrive at the root cause/s.

RCA manager should moderate the meeting and set the rules for the Brainstorming session. For example, the rules can be:

  • Criticizing/blaming others should not be allowed.
  • Don’t judge other’s ideas. No ideas are bad they encourage wild ideas.
  • Build on the ideas on others. Think about how you can build on other’s ideas and make it better.
  • Give each participant due time to share their views.
  • Encourage out of box thinking.
  • Stay focused.

All ideas should be recorded. RCA manager should assign a member to record the minutes of the meeting and update of RCA templates.

#4) Implement Root Cause Corrective Action (RCCA)

Correction action involves giving fix to the solution by identifying the real root cause. To facilitate this, a delivery manager has to be present who can decide in which all versions the fix has to be implemented and what should be the delivery date.

RCCA should be implemented in such a way that this root cause will not occur again in the future. Fix given by the support team will be temporary for the customer site where the issue is reported. When this fix is merged into an ongoing version, do proper impact analysis to ensure no existing feature is broken.

Give the steps to validate the fix and monitor the implemented solution to check if the solution is effective.

#5) Implement Root Cause Preventive Action (RCPA)

The team needs to come up with a plan for how such a similar issue can be prevented in the future. For example, Update Instruction Manual, improve skillset, update the team assessment checklist, etc. Follow proper documents of preventive actions and monitor whether the team is adhering to the preventive actions taken.

Please refer to this research paper on “Defect Analysis and Prevention for Software Process Quality Improvement” published in the International Journal of Software Engineering & Applications to get an idea of the types of defects reported in each software phase and suggested preventive actions for them.

The information gained from RCA can go as input into Failure Mode and Effect Analysis (FMEA ) to identify points where the solution can fail.

Implement Pareto Analysis with the causes identified during RCA over a period, say half-yearly or quarterly which will help to identify the top causes which are contributing to the defects and focus on preventive action for them.

Duke Ellington quote

Root Cause Analysis Techniques

Fishbone diagram is a visual root cause analysis tool to identify the possible causes of the identified problems and hence it’s also called Cause and Effect diagram. It allows you to get down to the real root cause of the issue rather than solving its symptom.

It’s also called the Ishikawa Diagram as it was created by Dr.Kaoru Ishikawa [a Japanese quality control statistician]. It’s also known as Herringbone or Fishikawa diagram.

Fishbone analysis is used in analyze phase of six sigma’s DMAIC approach for problem-solving. It’s one of the 7 basic tools of quality control .

Steps to create a Fishbone Diagram:

Fishbone diagram resembles the skeleton of a fish with the problem forming the head of fish and causes forming the spine and bones of the fish.

Follow the below steps to create a fishbone diagram:

  • Write the problem at the head of the fish .
  • Identify the category of causes and write at end of each bone [cause category 1, cause category 2 …… cause category N]
  • Identify the primary causes under each category and mark it as primary cause 1, primary cause 2, primary cause N.
  • Extend the causes to secondary, tertiary, and more levels as applicable.

FIshbone_template

An example of how a fishbone diagram is applied to a software defect (see below).

FIshbone_softwareDefect

There are a lot of free as well as paid tools available for creating a fishbone diagram. The Fishbone diagram in this tutorial was created using ‘ Creately’ online tool . More details about fishbone templates and tools will be explained in our next tutorial.

5 Why Technique was developed by  Sakichi Toyoda and was used at Toyota in their manufacturing industry. This technique refers to a series of questions where each answer is responded with a Why question. It can be related to how a child will ask questions to grown-ups. Based on the answer grown-up gives, they will ask “Why” questions again and again till they are satisfied.

5 Why technique is used standalone or as part of fishbone analysis to drill down to the root cause of the problem. The number of steps is not limited to 5. It can be less or more than 5 until the diagnosis of the problem has arrived. 5 Whys are relatively a simpler technique and faster way to arrive at the root causes. It facilitates quick diagnosis to rule out the symptoms and arrive at the root cause.

The success of the technique depends on the knowledge of the person. There can be different answers to the same Why question. So, selecting the right direction and focus in the meeting is important.

Steps to create 5 Whys diagram

Start the brainstorming discussion by defining the problem. Then follow with subsequent Why and their answers.

5Why_Template

An example of how 5 Whys diagram is applied to a software defect:

5Why_Software_Defect Example

5 Why template and images are drawn using Creately online software.

There are many factors which provoke the Defects to occur:

  • Unclear / Missing / Incorrect Requirements
  • Incorrect Design
  • Incorrect Coding
  • Insufficient Testing
  • Environment Issues (Hardware, Software or Configurations)

These factors should always be kept in mind while performing the RCA process.

RCA starts and proceeds with brainstorming on the defect. The only question which we ask ourselves while doing RCA is “WHY?” and “WHAT?” We can dig into each phase of the life cycle to track, where the defect persists.

Let’s start with the “WHY?” questions, (the list is not limited). You can start from the outer phase and move towards the inner phase of SDLC.

  • “WHY” the Defect was not caught during the Sanity Test in production?
  • “WHY” the Defect was not caught during Testing?
  • “WHY” the Defect was not caught during the Test case review?
  • “WHY” the Defect was not caught Unit Testing ?
  • “WHY” the Defect was not caught during “Design Review”?
  • “WHY” the Defect was not caught during the Requirement phase?

The answer to this question will give you the exact phase, where the defect exists. Now once you identify the phase and the reason, then comes the “WHAT” part.

“WHAT will you do to avoid this in the future?

The answer to this “WHAT” question, if implemented and taken care of, will prevent the same defect or the kind of defect to arise again. Take proper measures to improve the identified process so that the defect or the reason for the defect is not repeated.

Based on the results of RCA, you can determine which of the phase has problem areas.

For Example, if you determine most of the RCA of the defects are due to requirement miss , then you can improve the requirement gathering/understanding phase by introducing more reviews or walk-through sessions.

Similarly, if you find that most defects are due to testing miss , you need to improve the testing process. You can introduce metrics like Requirement Traceability Metrics , Test Coverage Metrics, or can keep a check on the review process or any other step that you feel would improve the efficiency of the testing.

It is the responsibility of the entire team to sit and analyze the defects and contribute to the product and process improvement.

  • Defect Prevention Methods and Techniques
  • Defect Triage Process and Ways to Handle Defect Triage Meeting
  • What is Defect Based Testing Technique?
  • What is Defect/Bug Life Cycle in Software Testing? Defect Life Cycle Tutorial
  • Bugzilla Tutorial: Defect Management Tool Hands-on Tutorial
  • Defect Management Process: How to Manage a Defect Effectively
  • Defect Severity and Priority in Testing with Examples and Difference
  • How Do You Decide Which Defects are Acceptable for the Software to Go-live?

Leave a Comment Cancel reply

  • Software Testing Course
  • Software Engineering Tutorial
  • Software Development Life Cycle
  • Waterfall Model
  • Software Requirements
  • Software Measurement and Metrics
  • Software Design Process
  • System configuration management
  • Software Maintenance
  • Software Development Tutorial
  • Software Testing Tutorial
  • Product Management Tutorial
  • Project Management Tutorial
  • Agile Methodology
  • Selenium Basics

Basic Principle of Root Cause Analysis

Root Cause Analysis (RCA) is one of main problem-solving techniques. It basically helps organizations in improving software quality that leads to organizational improvement by reducing number of defects in system. It is a systematic process of identifying major cause (root cause) of problems or defects and a helpful way for resolving them. Eliminate main cause of defect, so that one can reduce re-occurrence of defect. This is where RCA concept and tools are required and are very useful. This technique helps in following ways :

  • It improves reliability.
  • Continuously challenges defects and problems that are costly.
  • Prevent defects from re-occurrence in future.
  • Monitor and maximize return on business investments.

Basic Principle of RCA :

  • Main aim of RCA is to determine main cause of problem after which one can identify and take corrective measures and actions that will help eliminate main cause of defect. Due to this, we can prevent future-re-occurrence of defect.
  • One should be very focused on determining and identifying corrective measures that are needed to be taken for better and effective result rather than simply treating symptoms of defect on system.
  • Investigation process is very essential and critical process. It requires appropriate procedures to be followed for successful and correct results. Similarly, for better results, one should perform RCA process in a very effective manner with full focus and follow procedures.
  • We know that there is always a reason or cause of something to happened or occur. Defects are not something that arises without any reason. There are some mistakes either in process, system, requirements, etc. that have led to defect occurrence. Therefore, one needs to understand that for every defect, there is at least one root cause that resulted in occurrence of that particular defect. Yes, there might be some difficulty to find it, but one should give full effort and should have that much stamina to identify it.
  • For better-understanding relationship among factors that contributed, main cause and defect being define, analysis needs to develop a sequence of timeline or events.
  • Main focus should be on why defect occurred and what is major cause of defect rather than one who made error. Focus should be on How and Why defect occurred rather than Who is responsible.
  • Yes, symptoms are too important and should be treated equally, but focusing only on symptoms is not correct. One should focus on finding corrective ways to solve defect rather than just focusing on symptoms.

author

Similar Reads

  • Software Engineering

Please Login to comment...

  • PS5 Digital vs. Disc : Which PlayStation 5 Should You Buy in 2024?
  • AMD vs. Nvidia Graphics Cards 2024: Which GPU Should You Choose?
  • Top IPTV-Anbieter in Österreich für günstiges Streaming in 2024
  • IPTV Austria: Top 10 IPTV Subscription Providers for Austrian and International Channels
  • GeeksforGeeks Practice - Leading Online Coding Platform

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Chapter 12 - Effective Troubleshooting

Google

  • Table of Contents
  • Part I - Introduction
  • 1. Introduction
  • 2. The Production Environment at Google, from the Viewpoint of an SRE
  • Part II - Principles
  • 3. Embracing Risk
  • 4. Service Level Objectives
  • 5. Eliminating Toil
  • 6. Monitoring Distributed Systems
  • 7. The Evolution of Automation at Google
  • 8. Release Engineering
  • 9. Simplicity
  • Part III - Practices
  • 10. Practical Alerting
  • 11. Being On-Call
  • 12. Effective Troubleshooting
  • 13. Emergency Response
  • 14. Managing Incidents
  • 15. Postmortem Culture: Learning from Failure
  • 16. Tracking Outages
  • 17. Testing for Reliability
  • 18. Software Engineering in SRE
  • 19. Load Balancing at the Frontend
  • 20. Load Balancing in the Datacenter
  • 21. Handling Overload
  • 22. Addressing Cascading Failures
  • 23. Managing Critical State: Distributed Consensus for Reliability
  • 24. Distributed Periodic Scheduling with Cron
  • 25. Data Processing Pipelines
  • 26. Data Integrity: What You Read Is What You Wrote
  • 27. Reliable Product Launches at Scale
  • Part IV - Management
  • 28. Accelerating SREs to On-Call and Beyond
  • 29. Dealing with Interrupts
  • 30. Embedding an SRE to Recover from Operational Overload
  • 31. Communication and Collaboration in SRE
  • 32. The Evolving SRE Engagement Model
  • Part V - Conclusions
  • 33. Lessons Learned from Other Industries
  • 34. Conclusion
  • Appendix A. Availability Table
  • Appendix B. A Collection of Best Practices for Production Services
  • Appendix C. Example Incident State Document
  • Appendix D. Example Postmortem
  • Appendix E. Launch Coordination Checklist
  • Appendix F. Example Production Meeting Minutes
  • Bibliography

Effective Troubleshooting

Written by Chris Jones

Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn't work. Brian Redman
Ways in which things go right are special cases of the ways in which things go wrong. John Allspaw

Troubleshooting is a critical skill for anyone who operates distributed computing systems—especially SREs—but it’s often viewed as an innate skill that some people have and others don’t. One reason for this assumption is that, for those who troubleshoot often, it’s an ingrained process; explaining how to troubleshoot is difficult, much like explaining how to ride a bike. However, we believe that troubleshooting is both learnable and teachable.

Novices are often tripped up when troubleshooting because the exercise ideally depends upon two factors: an understanding of how to troubleshoot generically (i.e., without any particular system knowledge) and a solid knowledge of the system. While you can investigate a problem using only the generic process and derivation from first principles, 58 we usually find this approach to be less efficient and less effective than understanding how things are supposed to work. Knowledge of the system typically limits the effectiveness of an SRE new to a system; there’s little substitute to learning how the system is designed and built.

Let’s look at a general model of the troubleshooting process. Readers with expertise in troubleshooting may quibble with our definitions and process; if your method is effective for you, there’s no reason not to stick with it.

Formally, we can think of the troubleshooting process as an application of the hypothetico-deductive method: 59 given a set of observations about a system and a theoretical basis for understanding system behavior, we iteratively hypothesize potential causes for the failure and try to test those hypotheses.

In an idealized model such as that in Figure 12-1 , we’d start with a problem report telling us that something is wrong with the system. Then we can look at the system’s telemetry 60 and logs to understand its current state. This information, combined with our knowledge of how the system is built, how it should operate, and its failure modes, enables us to identify some possible causes.

A process for troubleshooting.

We can then test our hypotheses in one of two ways. We can compare the observed state of the system against our theories to find confirming or disconfirming evidence. Or, in some cases, we can actively “treat” the system—that is, change the system in a controlled way—and observe the results. This second approach refines our understanding of the system’s state and possible cause(s) of the reported problems. Using either of these strategies, we repeatedly test until a root cause is identified, at which point we can then take corrective action to prevent a recurrence and write a postmortem. Of course, fixing the proximate cause(s) needn’t always wait for root-causing or postmortem writing.

In Practice

In practice, of course, troubleshooting is never as clean as our idealized model suggests it should be. There are some steps that can make the process less painful and more productive for both those experiencing system problems and those responding to them.

Problem Report

Every problem starts with a problem report, which might be an automated alert or one of your colleagues saying, “The system is slow.” An effective report should tell you the expected behavior, the actual behavior, and, if possible, how to reproduce the behavior. 65 Ideally, the reports should have a consistent form and be stored in a searchable location, such as a bug tracking system. Here, our teams often have customized forms or small web apps that ask for information that’s relevant to diagnosing the particular systems they support, which then automatically generate and route a bug. This may also be a good point at which to provide tools for problem reporters to try self-diagnosing or self-repairing common issues on their own.

It’s common practice at Google to open a bug for every issue, even those received via email or instant messaging. Doing so creates a log of investigation and remediation activities that can be referenced in the future. Many teams discourage reporting problems directly to a person for several reasons: this practice introduces an additional step of transcribing the report into a bug, produces lower-quality reports that aren’t visible to other members of the team, and tends to concentrate the problem-solving load on a handful of team members that the reporters happen to know, rather than the person currently on duty (see also Dealing with Interrupts ).

Once you receive a problem report, the next step is to figure out what to do about it. Problems can vary in severity: an issue might affect only one user under very specific circumstances (and might have a workaround), or it might entail a complete global outage for a service. Your response should be appropriate for the problem’s impact: it’s appropriate to declare an all-hands-on-deck emergency for the latter (see Managing Incidents ), but doing so for the former is overkill. Assessing an issue’s severity requires an exercise of good engineering judgment and, often, a degree of calm under pressure .

Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct!

Instead, your course of action should be to make the system work as well as it can under the circumstances . This may entail emergency options, such as diverting traffic from a broken cluster to others that are still working, dropping traffic wholesale to prevent a cascading failure, or disabling subsystems to lighten the load. Stopping the bleeding should be your first priority; you aren’t helping your users if the system dies while you’re root-causing. Of course, an emphasis on rapid triage doesn’t preclude taking steps to preserve evidence of what’s going wrong, such as logs, to help with subsequent root-cause analysis.

Novice pilots are taught that their first responsibility in an emergency is to fly the airplane [Gaw09] ; troubleshooting is secondary to getting the plane and everyone on it safely onto the ground. This approach is also applicable to computer systems: for example, if a bug is leading to possibly unrecoverable data corruption, freezing the system to prevent further failure may be better than letting this behavior continue.

This realization is often quite unsettling and counterintuitive for new SREs, particularly those whose prior experience was in product development organizations.

We need to be able to examine what each component in the system is doing in order to understand whether or not it’s behaving correctly.

Ideally, a monitoring system is recording metrics for your system as discussed in Practical Alerting from Time-Series Data . These metrics are a good place to start figuring out what’s wrong. Graphing time-series and operations on time-series can be an effective way to understand the behavior of specific pieces of a system and find correlations that might suggest where problems began. 66

Logging is another invaluable tool. Exporting information about each operation and about system state makes it possible to understand exactly what a process was doing at a given point in time. You may need to analyze system logs across one or many processes. Tracing requests through the whole stack using tools such as Dapper [Sig10] provides a very powerful way to understand how a distributed system is working, though varying use cases imply significantly different tracing designs [Sam14] .

Exposing current state is the third trick in our toolbox. For example, Google servers have endpoints that show a sample of RPCs recently sent or received, so it’s possible to understand how any one server is communicating with others without referencing an architecture diagram. These endpoints also show histograms of error rates and latency for each type of RPC, so that it’s possible to quickly tell what’s unhealthy. Some systems have endpoints that show their current configuration or allow examination of their data; for instance, Google’s Borgmon servers ( Practical Alerting from Time-Series Data ) can show the monitoring rules they’re using, and even allow tracing a particular computation step-by-step to the source metrics from which a value is derived.

Finally, you may even need to instrument a client to experiment with, in order to discover what a component is returning in response to requests.

A thorough understanding of the system’s design is decidedly helpful for coming up with plausible hypotheses about what’s gone wrong, but there are also some generic practices that will help even without domain knowledge.

Simplify and reduce

Ideally, components in a system have well-defined interfaces and perform known transformations from their input to their output (in our example, given an input search text, a component might return output containing possible matches). It’s then possible to look at the connections between components—or, equivalently, at the data flowing between them—to determine whether a given component is working properly. Injecting known test data in order to check that the resulting output is expected (a form of black-box testing) at each step can be especially effective, as can injecting data intended to probe possible causes of errors. Having a solid reproducible test case makes debugging much faster, and it may be possible to use the case in a non-production environment where more invasive or riskier techniques are available than would be possible in production.

Dividing and conquering is a very useful general-purpose solution technique. In a multilayer system where work happens throughout a stack of components, it’s often best to start systematically from one end of the stack and work toward the other end, examining each component in turn. This strategy is also well-suited for use with data processing pipelines. In exceptionally large systems, proceeding linearly may be too slow; an alternative, bisection , splits the system in half and examines the communication paths between components on one side and the other. After determining whether one half seems to be working properly, repeat the process until you’re left with a possibly faulty component.

Ask "what," "where," and "why"

A malfunctioning system is often still trying to do something —just not the thing you want it to be doing. Finding out what it’s doing, then asking why it’s doing that and where its resources are being used or where its output is going can help you understand how things have gone wrong. 67

What touched it last

Systems have inertia: we’ve found that a working computer system tends to remain in motion until acted upon by an external force, such as a configuration change or a shift in the type of load served. Recent changes to a system can be a productive place to start identifying what’s going wrong. 69

Well-designed systems should have extensive production logging to track new version deployments and configuration changes at all layers of the stack, from the server binaries handling user traffic down to the packages installed on individual nodes in the cluster. Correlating changes in a system’s performance and behavior with other events in the system and environment can also be helpful in constructing monitoring dashboards; for example, you might annotate a graph showing the system’s error rates with the start and end times of a deployment of a new version, as seen in Figure 12-2 .

Error rates graphed against deployment start and end times.

Specific diagnoses

While the generic tools described previously are helpful across a broad range of problem domains, you will likely find it helpful to build tools and systems to help with diagnosing your particular services. Google SREs spend much of their time building such tools. While many of these tools are necessarily specific to a given system, be sure to look for commonalities between services and teams to avoid duplicating effort.

Test and Treat

Once you’ve come up with a short list of possible causes, it’s time to try to find which factor is at the root of the actual problem. Using the experimental method, we can try to rule in or rule out our hypotheses. For instance, suppose we think a problem is caused by either a network failure between an application logic server and a database server, or by the database refusing connections. Trying to connect to the database with the same credentials the application logic server uses can refute the second hypothesis, while pinging the database server may be able to refute the first, depending on network topology, firewall rules, and other factors. Following the code and trying to imitate the code flow, step-by-step, may point to exactly what’s going wrong.

There are a number of considerations to keep in mind when designing tests (which may be as simple as sending a ping or as complicated as removing traffic from a cluster and injecting specially formed requests to find a race condition):

  • An ideal test should have mutually exclusive alternatives, so that it can rule one group of hypotheses in and rule another set out. In practice, this may be difficult to achieve.
  • Consider the obvious first: perform the tests in decreasing order of likelihood, considering possible risks to the system from the test. It probably makes more sense to test for network connectivity problems between two machines before looking into whether a recent configuration change removed a user’s access to the second machine.
  • An experiment may provide misleading results due to confounding factors. For example, a firewall rule might permit access only from a specific IP address, which might make pinging the database from your workstation fail, even if pinging from the application logic server’s machine would have succeeded.
  • Active tests may have side effects that change future test results. For instance, allowing a process to use more CPUs may make operations faster, but might increase the likelihood of encountering data races. Similarly, turning on verbose logging might make a latency problem even worse and confuse your results: is the problem getting worse on its own, or because of the logging?
  • Some tests may not be definitive, only suggestive. It can be very difficult to make race conditions or deadlocks happen in a timely and reproducible manner, so you may have to settle for less certain evidence that these are the causes.

Take clear notes of what ideas you had, which tests you ran, and the results you saw. 70 Particularly when you are dealing with more complicated and drawn-out cases, this documentation may be crucial in helping you remember exactly what happened and prevent having to repeat these steps. 71 If you performed active testing by changing a system—for instance by giving more resources to a process—making changes in a systematic and documented fashion will help you return the system to its pre-test setup, rather than running in an unknown hodge-podge configuration.

Negative Results Are Magic

Written by Randall Bosetti Edited by Joan Wendt

A "negative" result is an experimental outcome in which the expected effect is absent—that is, any experiment that doesn’t work out as planned. This includes new designs, heuristics, or human processes that fail to improve upon the systems they replace.

Negative results should not be ignored or discounted. Realizing you’re wrong has much value: a clear negative result can resolve some of the hardest design questions. Often a team has two seemingly reasonable designs but progress in one direction has to address vague and speculative questions about whether the other direction might be better.

Experiments with negative results are conclusive. They tell us something certain about production, or the design space, or the performance limits of an existing system. They can help others determine whether their own experiments or designs are worthwhile. For example, a given development team might decide against using a particular web server because it can handle only ~800 connections out of the needed 8,000 connections before failing due to lock contention. When a subsequent development team decides to evaluate web servers, instead of starting from scratch, they can use this already well-documented negative result as a starting point to decide quickly whether (a) they need fewer than 800 connections or (b) the lock contention problems have been resolved.

Even when negative results do not apply directly to someone else’s experiment, the supplementary data gathered can help others choose new experiments or avoid pitfalls in previous designs. Microbenchmarks, documented antipatterns, and project postmortems all fit this category. You should consider the scope of the negative result when designing an experiment, because a broad or especially robust negative result will help your peers even more.

Tools and methods can outlive the experiment and inform future work. As an example, benchmarking tools and load generators can result just as easily from a disconfirming experiment as a supporting one. Many webmasters have benefited from the difficult, detail-oriented work that produced Apache Bench, a web server loadtest, even though its first results were likely disappointing.

Building tools for repeatable experiments can have indirect benefits as well: although one application you build may not benefit from having its database on SSDs or from creating indices for dense keys, the next one just might. Writing a script that allows you to easily try out these configuration changes ensures you don’t forget or miss optimizations in your next project.

Publishing negative results improves our industry’s data-driven culture. Accounting for negative results and statistical insignificance reduces the bias in our metrics and provides an example to others of how to maturely accept uncertainty. By publishing everything, you encourage others to do the same, and everyone in the industry collectively learns much more quickly. SRE has already learned this lesson with high-quality postmortems, which have had a large positive effect on production stability.

Publish your results. If you are interested in an experiment’s results, there’s a good chance that other people are as well. When you publish the results, those people do not have to design and run a similar experiment themselves. It’s tempting and common to avoid reporting negative results because it’s easy to perceive that the experiment "failed." Some experiments are doomed, and they tend to be caught by review. Many more experiments are simply unreported because people mistakenly believe that negative results are not progress.

Do your part by telling everyone about the designs, algorithms, and team workflows you’ve ruled out. Encourage your peers by recognizing that negative results are part of thoughtful risk taking and that every well-designed experiment has merit. Be skeptical of any design document, performance review, or essay that doesn’t mention failure. Such a document is potentially either too heavily filtered, or the author was not rigorous in his or her methods.

Above all, publish the results you find surprising so that others—including your future self—aren’t surprised.

Ideally, you’ve now narrowed the set of possible causes to one. Next, we’d like to prove that it’s the actual cause. Definitively proving that a given factor caused a problem—by reproducing it at will—can be difficult to do in production systems; often, we can only find probable causal factors, for the following reasons:

  • Systems are complex . It’s quite likely that there are multiple factors, each of which individually is not the cause, but which taken jointly are causes. 72 Real systems are also often path-dependent, so that they must be in a specific state before a failure occurs.
  • Reproducing the problem in a live production system may not be an option , either because of the complexity of getting the system into a state where the failure can be triggered, or because further downtime may be unacceptable. Having a nonproduction environment can mitigate these challenges, though at the cost of having another copy of the system to run.

Once you’ve found the factors that caused the problem, it’s time to write up notes on what went wrong with the system, how you tracked down the problem, how you fixed the problem, and how to prevent it from happening again. In other words, you need to write a postmortem (although ideally, the system is alive at this point!).

App Engine, 73 part of Google’s Cloud Platform, is a platform-as-a-service product that allows developers to build services atop Google’s infrastructure. One of our internal customers filed a problem report indicating that they’d recently seen a dramatic increase in latency, CPU usage, and number of running processes needed to serve traffic for their app, a content-management system used to build documentation for developers. 74 The customer couldn’t find any recent changes to their code that correlated with the increase in resources, and there hadn’t been an increase in traffic to their app (see Figure 12-3 ), so they were wondering if a change in the App Engine service was responsible.

Our investigation discovered that latency had indeed increased by nearly an order of magnitude (as shown in Figure 12-4 ). Simultaneously, the amount of CPU time ( Figure 12-5 ) and number of serving processes ( Figure 12-6 ) had nearly quadrupled. Clearly something was wrong. It was time to start troubleshooting.

Application’s requests received per second, showing a brief spike and return to normal.

Typically a sudden increase in latency and resource usage indicates either an increase in traffic sent to the system or a change in system configuration. However, we could easily rule out both of these possible causes: while a spike in traffic to the app around 20:45 could explain a brief surge in resource usage, we’d expect traffic to return to baseline fairly soon after request volume normalized. This spike certainly shouldn’t have continued for multiple days, beginning when the app’s developers filed the report and we started looking into the problem. Second, the change in performance happened on Saturday, when neither changes to the app nor the production environment were in flight. The service’s most recent code pushes and configuration pushes had completed days before. Furthermore, if the problem originated with the service, we’d expect to see similar effects on other apps using the same infrastructure. However, no other apps were experiencing similar effects.

We referred the problem report to our counterparts, App Engine’s developers, to investigate whether the customer was encountering any idiosyncrasies in the serving infrastructure. The developers weren’t able to find any oddities, either. However, a developer did notice a correlation between the latency increase and the increase of a specific data storage API call, merge_join , which often indicates suboptimal indexing when reading from the datastore. Adding a composite index on the properties the app uses to select objects from the datastore would speed those requests, and in principle, speed the application as a whole—but we’d need to figure out which properties needed indexing. A quick look at the application’s code didn’t reveal any obvious suspects.

It was time to pull out the heavy machinery in our toolkit: using Dapper [Sig10] , we traced the steps individual HTTP requests took—from their receipt by a frontend reverse proxy through to the point where the app’s code returned a response—and looked at the RPCs issued by each server involved in handling that request. Doing so would allow us to see which properties were included in requests to the datastore, then create the appropriate indices.

While investigating, we discovered that requests for static content such as images, which weren’t served from the datastore, were also much slower than expected. Looking at graphs with file-level granularity, we saw their responses had been much faster only a few days before. This implied that the observed correlation between merge_join and the latency increase was spurious and that our suboptimal-indexing theory was fatally flawed.

Examining the unexpectedly slow requests for static content, most of the RPCs sent from the application were to a memcache service, so the requests should have been very fast—on the order of a few milliseconds. These requests did turn out to be very fast, so the problem didn’t seem to originate there. However, between the time the app started working on a request and when it made the first RPCs, there was about a 250 ms period where the app was doing…well, something . Because App Engine runs code provided by users, its SRE team does not profile or inspect app code, so we couldn’t tell what the app was doing in that interval; similarly, Dapper couldn’t help track down what was going on since it can only trace RPC calls, and none were made during that period.

Faced with what was, by this point, quite a mystery, we decided not to solve it… yet . The customer had a public launch scheduled for the following week, and we weren’t sure how soon we’d be able to identify the problem and fix it. Instead, we recommended that the customer increase the resources allocated to their app to the most CPU-rich instance type available. Doing so reduced the app’s latency to acceptable levels, though not as low as we’d prefer. We concluded that the latency mitigation was good enough that the team could conduct their launch successfully, then investigate at leisure. 75

At this point, we suspected that the app was a victim of yet another common cause of sudden increases in latency and resource usage: a change in the type of work. We’d seen an increase in writes to the datastore from the app, just before its latency increased, but because this increase wasn’t very large—nor was it sustained—we’d written it off as coincidental. However, this behavior did resemble a common pattern: an instance of the app is initialized by reading objects from the datastore, then storing them in the instance’s memory. By doing so, the instance avoids reading rarely changing configuration from the datastore on each request, and instead checks the in-memory objects. Then, the time it takes to handle requests will often scale with the amount of configuration data. 76 We couldn’t prove that this behavior was the root of the problem, but it’s a common antipattern.

The app developers added instrumentation to understand where the app was spending its time. They identified a method that was called on every request, that checked whether a user had whitelisted access to a given path. The method used a caching layer that sought to minimize accesses to both the datastore and the memcache service, by holding whitelist objects in instances’ memory. As one of the app’s developers noted in the investigation, “I don’t know where the fire is yet, but I’m blinded by smoke coming from this whitelist cache.”

Some time later, the root cause was found: due to a long-standing bug in the app’s access control system, whenever one specific path was accessed, a whitelist object would be created and stored in the datastore. In the run-up to launch, an automated security scanner had been testing the app for vulnerabilities, and as a side effect, its scan produced thousands of whitelist objects over the course of half an hour. These superfluous whitelist objects then had to be checked on every request to the app, which led to pathologically slow responses—without causing any RPC calls from the app to other services. Fixing the bug and removing those objects returned the app’s performance to expected levels.

Making Troubleshooting Easier

There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are:

  • Building observability—with both white-box metrics and structured logs—into each component from the ground up
  • Designing systems with well-understood and observable interfaces between components .

Ensuring that information is available in a consistent way throughout a system—for instance, using a unique request identifier throughout the span of RPCs generated by various components—reduces the need to figure out which log entry on an upstream component matches a log entry on a downstream component, speeding the time to diagnosis and recovery.

Problems in correctly representing the state of reality in a code change or an environment change often lead to a need to troubleshoot. Simplifying, controlling, and logging such changes can reduce the need for troubleshooting, and make it easier when it happens.

We’ve looked at some steps you can take to make the troubleshooting process clear and understandable to novices, so that they, too, can become effective at solving problems. Adopting a systematic approach to troubleshooting—as opposed to relying on luck or experience—can help bound your services’ time to recovery, leading to a better experience for your users.

58 Indeed, using only first principles and troubleshooting skills is often an effective way to learn how a system works; see Accelerating SREs to On-Call and Beyond .

59 See https://en.wikipedia.org/wiki/Hypothetico-deductive_model .

60 For instance, exported variables as described in Practical Alerting from Time-Series Data .

61 Attributed to Theodore Woodward, of the University of Maryland School of Medicine, in the 1940s. See https://en.wikipedia.org/wiki/Zebra_(medicine) . This works in some domains, but in some systems, entire classes of failures may be eliminable: for instance, using a well-designed cluster filesystem means that a latency problem is unlikely to be due to a single dead disk.

62 Occam’s Razor; see https://en.wikipedia.org/wiki/Occam%27s_razor . But remember that it may still be the case that there are multiple problems; in particular, it may be more likely that a system has a number of common low-grade problems that, taken together, explain all the symptoms rather than a single rare problem that causes them all. Cf https://en.wikipedia.org/wiki/Hickam%27s_dictum .

63 Of course, see https://xkcd.com/552 .

64 At least, we have no plausible theory to explain why the number of PhDs awarded in Computer Science in the US should be extremely well correlated (r 2 = 0.9416) with the per capita consumption of cheese, between 2000 and 2009: https://tylervigen.com/view_correlation?id=1099 .

65 It may be useful to refer prospective bug reporters to [Tat99] to help them provide high-quality problem reports.

66 But beware false correlations that can lead you down wrong paths!

67 In many respects, this is similar to the “Five Whys” technique [Ohn88] introduced by Taiichi Ohno to understand the root causes of manufacturing errors.

68 In contrast to RE2, PCRE can require exponential time to evaluate some regular expressions. RE2 is available at https://github.com/google/re2 .

69 [All15] observes this is a frequently used heuristic in resolving outages.

70 Using a shared document or real-time chat for notes provides a timestamp of when you did something, which is helpful for postmortems. It also shares that information with others, so they’re up to speed with the current state of the world and don’t need to interrupt your troubleshooting.

71 See also Negative Results Are Magic for more on this point.

72 See [Mea08] on how to think about systems, and also [Coo00] and [Dek14] on the limitations of finding a single root cause instead of examining the system and its environment for causative factors.

73 See https://cloud.google.com/appengine .

74 We have compressed and simplified this case study to aid understanding.

75 While launching with an unidentified bug isn’t ideal, it’s often impractical to eliminate all known bugs. Instead, sometimes we have make do with second-best measures and mitigate risk as best we can, using good engineering judgment.

76 The datastore lookup can use an index to speed the comparison, but a frequent in-memory implementation is a simple for loop comparison across all the cached objects. If there are only a few objects, it won’t matter that this takes linear time—but this can cause a significant increase in latency and resource usage as the number of cached objects grows.

Chapter 11 - Being On-Call

Chapter 13 - Emergency Response

Copyright © 2017 Google, Inc. Published by O'Reilly Media, Inc. Licensed under CC BY-NC-ND 4.0

Root Cause Analysis

Troubleshooting Problems

  • First Online: 01 January 2013

Cite this chapter

principles of root cause problem solving using fault diagnostics for troubleshooting

  • Scott Cromar 1  

1588 Accesses

Troubleshooting refers to the methods used to resolve problems. People who troubleshoot a lot come up with a set of habits, methods, and tools to help with the process. These provide a standard approach for gathering the necessary information to zero in on the cause of a problem. This standard approach is known as a methodology .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

1Sigeru Mizuno, ed., Management for Quality Improvement: The 7 New QC Tools . Cambridge, England: Productivity Press, 1988.

2Eliyahu M. Goldratt, Essays on the Theory of Constraints . Great Barrington, MA: North River Press, 1987.

3See Anthony Mark Doggett, “A Statistical Analysis of Three Root Cause Analysis Tools,” Journal of Industrial Technology 20 (Feb.-April 2004): 5. www.nait.org/jit/Articles/doggett010504.pdf .

Author information

Authors and affiliations.

Scott Cromar

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Scott Cromar

About this chapter

Cromar, S. (2013). Root Cause Analysis. In: From Techie to Boss. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5933-6_9

Download citation

DOI : https://doi.org/10.1007/978-1-4302-5933-6_9

Published : 15 April 2013

Publisher Name : Apress, Berkeley, CA

Print ISBN : 978-1-4302-5932-9

Online ISBN : 978-1-4302-5933-6

eBook Packages : Business and Economics Apress Access Books Business and Management (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. Root Cause Analysis: Solving Causes Not Symptoms

    principles of root cause problem solving using fault diagnostics for troubleshooting

  2. Sign displaying Root Cause Analysis. Business approach Method of Problem Solving Identify Fault

    principles of root cause problem solving using fault diagnostics for troubleshooting

  3. Conceptual display Root Cause Analysis, Conceptual photo Method of Problem Solving Identify

    principles of root cause problem solving using fault diagnostics for troubleshooting

  4. Conceptual display Root Cause Analysis, Business idea Method of Problem Solving Identify Fault

    principles of root cause problem solving using fault diagnostics for troubleshooting

  5. Inspiration showing sign Root Cause Analysis, Internet Concept Method of Problem Solving

    principles of root cause problem solving using fault diagnostics for troubleshooting

  6. Conceptual display Root Cause Analysis. Business concept Method of Problem Solving Identify

    principles of root cause problem solving using fault diagnostics for troubleshooting

VIDEO

  1. VIMS

  2. Root Cause Analysis

  3. Basics of Fault Analysis

  4. Root Cause Analysis in 60 Seconds! 🛠️ Fix Problems at the Source!

  5. Don't Solve the Problem You're Asked to Solve..!!!

  6. Root Cause Analysis || Continuous Improvement

COMMENTS

  1. RCA in IT: Root Cause Analysis for IT Environments

    Root cause analysis (RCA) is a systematic process for finding and identifying the root cause of a problem or event. RCA is based on the basic idea that having a truly effective system means more than just putting out fires all day.

  2. 7 Powerful Root Cause Analysis Tools and Techniques

    Explore 7 powerful RCA techniques to enhance problem-solving. From Fishbone Diagrams to FMEA, unlock effective strategies for identifying root causes.

  3. Root Cause Analysis (RCA) Methods for Effective Problem Solving

    To effectively utilize Root Cause Analysis (RCA), first identify the problem at hand. Determine the specific issue, incident, or failure that needs to be investigated. Clearly define the problem and its impact on your organization’s operations in order to establish a focused and valuable analysis.

  4. Root Cause Analysis: Definition, Process and Tools - Toolshero

    Root Cause Analysis (RCA) is a method of problem solving that aims at identifying the root causes of problems or incidents. RCA is based on the principle that problems can best be solved by correcting their root causes as opposed to other methods that focus on addressing the symptoms of problems or treating the symptoms.

  5. Guide To Root Cause Analysis - Steps, Techniques & Examples

    RCA (Root Cause Analysis) is a structured and effective process to find the root cause of issues in a Software Project team. If performed systematically, it can improve the performance and quality of the deliverables and the processes, not only at the team level but also across the organization.

  6. Root Cause Analysis: Definition, Examples & Methods - Tableau

    Root cause analysis (RCA) is the process of discovering the root causes of problems in order to identify appropriate solutions. RCA assumes that it is much more effective to systematically prevent and solve for underlying issues rather than just treating ad hoc symptoms and putting out fires.

  7. Basic Principle of Root Cause Analysis - GeeksforGeeks

    Root Cause Analysis (RCA) is one of main problem-solving techniques. It basically helps organizations in improving software quality that leads to organizational improvement by reducing number of defects in system. It is a systematic process of identifying major cause (root cause) of problems or defects and a helpful way for resolving them.

  8. Root Cause Analysis - an overview | ScienceDirect Topics

    The goal of root-cause analysis (RCA) is to prevent the problem from reoccurring by finding and eliminating the root causes. In other words, RCA is the application of a set of problem-solving methods based on identifying and eliminating (or correcting) the root causes of a problem.

  9. Google SRE - Troubleshooting Methodology: A Learning Path

    Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct! Instead, your course of action should be to make the system work as well as it can under the circumstances .

  10. Root Cause Analysis - SpringerLink

    Root Cause Analysis. Troubleshooting Problems. Chapter. First Online: 01 January 2013. pp 135–161. Cite this chapter. Download book PDF. Download book EPUB. From Techie to Boss. Scott Cromar. 1587 Accesses. Abstract. Troubleshooting refers to the methods used to resolve problems.