Machine Learning for Automated Root Cause Analysis: Promise and Pain – The New Stack

Let’s envision a world where root causes are instantly identified the moment any system degradation occurs:

Maria, an e-commerce site reliability engineer, wakes up to an alert that the site’s checkout success rate has dropped 15% over the last 30 minutes due to higher-than-normal failure rates. With traditional monitoring tools, this would take hours of manual analysis to troubleshoot.

Instead, within seconds, Maria’s AIOps platform sends a notification showing the root cause: A dependency used by the payment microservice has been degraded, slowing transaction-processing times. The latest version of the payment service couldn’t handle the scale placed on the prior version.

The AIOps platform then details all affected components and APIs involved in this event. With this insight, Maria immediately knows both the blast radius and scope of the issue. She quickly resolves the problem by rolling back the last update made to the payment service, and checkout success rates are restored without any further customer impact. Going from alert to resolution took less than 5 minutes.

This level of automated root cause analysis delivers immense benefits:

This promise seems almost too good to be true. And indeed, multiple barriers obstruct the path to production-grade ML pipelines for root cause analysis.

To understand why, think about your production environment as if it were a car. You’re driving on the freeway when your engine starts rattling, sputtering and eventually stalling. If you were trying to replace your mechanic with an ML algorithm to identify the root cause, what are some of the challenges you might encounter?

Let’s explore further these pitfalls inhibiting automated root cause analysis:

1. No machine-readable system topology

ML models can only spot patterns in data they can access. Without an existing topology mapping the thousands of interdependent services, containers, APIs and infrastructure elements, models have no pathway to traverse failures across domains.

Manually creating this topology is remarkably complex and sometimes impossible as production environments dynamically scale across hybrid cloud infrastructure.

2. Root cause inference at scale

Even with a topology, searching during an incident poses scalability issues. Existing ML libraries cannot handle production causality analysis.

To diagnose checkout failure, should we evaluate payment APIs or database clusters? Intuitively, an engineer would prioritize services tied to revenue delivery. But generic ML techniques lack this reasoning, forcing an exponential search across all topology layers — like holding a microphone to every inch of a car engine.

Advanced algorithms are needed to traverse topology graphs during incidents, weighing and filtering options based on business criticality. Both simple and intricate failure chains must be unpackaged — all before revenue and trust disappear.

3. Interpretability for humans 

Finally, ML troubleshooting creates a new challenge: how to make inferences understandable to humans. Identifying patterns in metrics data reveals statistical correlations between events, but not causal priority chains:

But this diagnosis doesn’t answer the questions that provide actionable insights to engineers:

Solving this final-mile problem requires models that capture and visualize rootcause probability, business-impact sequencing, risk levels and mitigation recommendations.

While core machine learning techniques show promise, purpose-built solutions are necessary to address the complexity of causality analysis at production scale. Combining specialized topology inference, heuristic graph search algorithms and interpretable data science unlocks the power of automated root cause analysis. But it requires advances in data collection, service mapping, ML and the communication of technical insights — all with the goal of remediation.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon Europe in Paris, from March 19-22.

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.

SUBSCRIBE

See the original post:
Machine Learning for Automated Root Cause Analysis: Promise and Pain - The New Stack

Related Posts

Comments are closed.