Root Cause Analysis (RCA) is a method of IT problem-solving. It is typically applied to identifying the origin of problems in a network that are impacting the delivery of services.
RCA must be performed systematically and is therefore often thought of in terms of “The Five Whys.” The Five Whys concept drills down from the initial appearance of the problem to identifying the incident that created the problem in the first place. You keep asking “Why” until the root cause has been determined.
It’s like recreating a series of domino collisions to determine why the last domino fell. When you retrace the collisions, you’ll eventually find what triggered the first domino in the chain reaction.
Here’s a broad example:
People are reporting that a business portal isn’t working…
1. Why? The e-commerce service is down.
2. Why? The system is unresponsive to user entries.
3. Why? A large server in New York failed.
4. Why? It couldn’t handle a surge of data that occurred at 9 am.
5. Why? It was not properly serviced according to the maintenance procedure.
In reality, a big data RCA solution could have hundreds of "Whys," whereas more straightforward problems, like a server that needs to be rebooted, might be identified with just one or two.
The Importance of Root Cause Analysis
Naturally, your first "Why" might be, “Why is RCA important?” RCA is important in IT service operations because it helps identify—and subsequently resolve—problems that impact revenue-generating services.
The infrastructure underpinning today's business services generates tidal waves of events. In turn, monitoring and management systems create alerts that provide a window into all of the problems, roadblocks, challenges, and of course, reboots, that are impacting the delivery of your services.
These millions of raw events and associated alerts are inundating even the most well-equipped and staffed IT teams. As the deluge continues, it often becomes impossible to get a handle on ways to address the problem. IT managers need answers for questions such as, “Which business services and users are impacted—and to what degree? What is really causing this event? How quickly can we identify the source of the problem and fix it?”
[Read the Forrester Research Study: Firefighter or Strategic Manager, What IT Role Do You Fulfill?]
The necessity of filtering out irrelevant events to reveal true root cause—and its true business impact—is the heart of the concept called Clean Signal. Let’s take a look at what we mean when we talk about Clean Signal.
Clean Signal Equals a Clear Understanding
The holy grail of IT service monitoring is to determine the true root cause and obtain a clear understanding of the affected business service. That’s the Clean Signal. But once you find the Clean Signal, what happens next—remediation and prevention—is how you actually derive value from your RCA technology.
It is a critical requirement that your monitoring system has knowledge of the network configuration. For instance, the failure of an upstream device carries with it the high likelihood that a downstream device will eventually be adversely affected.
Your management/monitoring technology must also understand the applications being run, as well as the application-level dependencies that exist to provide true root cause analysis. For example, email cannot be delivered without functioning DNS availability. But since DNS is a behind-the-scenes service, one may think that the problem resides on the email server, when it’s actually a DNS issue.
The knowledge base that determines these correlations should be able to provide established threshold levels and other pertinent data as well in a rules-based event-processing environment.
Automated Remediation to Reduce Operator Workload
Once you identify the root cause, you still need to take steps to correct the issue. And, as if you needed another challenge, you also want to make sure the remediation has minimal impact on the business service.
By referencing a library of workflows, both on a technical and a business level, automated remediation is the crucial step in reducing operator workload. Many issues that operators deal with are resolved with a simple reset of the culprit service or a reboot of a given server. Automated remediation take care of this, saving you time and money.
The Clean Signal gets the right information to the right person and results in measurable gains in your service uptime.