The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Causality-Driven Mitigation of Cascading Failures in Distributed Systems

Principal Investigator: Yongle Zhang

Cascading failure is a major cause of large-scale outages in modern cloud systems. Such failures manifest through runaway positive feedback loops, where failures amplify and replicate through the entire system. The risk of such positive feedback loops continues to escalate with the increasing cloud system complexity, a trend driven by architectures such as serverless computing and microservices that inherently feature high degrees of concurrency and interaction. Despite their critical impact, current practices rely primarily on black-box mitigation techniques such as rate limiting and circuit breaking, which often fail to address positive feedback loops originated from internal system behaviors.

 

In this project, we design a white-box approach to understand, detect, and mitigate positive feedback loops using causality analysis. Since positive feedback loops are fundamentally causal loops where failures cause the same type of failures, we can identify positive feedback loops by tracking causality across internal events. We propose the following research thrusts: (1) Architecture Analysis: Identifying and mitigating intrinsic positive feedback loops. We will advance the understanding of positive feedback loops by identifying those stemming from intrinsic features of distributed systems. We will apply generic causal loop prevention – tracking causality and breaking causal loops – on intrinsic positive feedback loops and explore its limits. (2) Advanced Testing: Hunting for accidental positive feedback loops. To expose accidental positive feedback loops, we will design novel testing techniques to detect causal loops by causally stitching failure propagations discovered in di!erent fault injection experiments. (3) Runtime Defense: Controlling emerging positive feedback loops. We will design runtime causal loop detection techniques using causal inference, to detect and control positive feedback loops that escape testing. (4) Production Diagnosis: Interventional debugging of runaway positive feedback loops. We will design new diagnosis techniques to enable selective and safe online debugging for runaway positive feedback loops.