Fault Determination and Recovery in Cycle-sharing Infrastructures

Research Areas: End System Security,

Principal Investigator: Suresh Jagannatan, Jan Vitek

This project is expected to make three broad contributions towards developing a runtime infrastructure, called PROGNOSIS, for failure data collection and online analysis. The first set of contributions will be on collecting and analyzing system events and failure data from an actual BlueGene/L system over an extended period of time. In addition to presenting the raw system events, we will be developing filtering techniques to remove unimportant information and identifying stationary intervals, together with defining the attributes for logging and their frequency. The second set of contributions will be models for online analysis and prediction of evolving failure data by exploiting correlations between system events over time, across the nodes, and with respect to external factors such as imposed workload and operating temperature. The third set of contributions will be on demonstrating the uses of PROGNOSIS. This work will be specifically extending two important runtime techniques - parallel job scheduling and checkpointing - with the information provided by PROGNOSIS; will investigate how predictability of failures along spatial and/or temporal dimensions can enhance schedulers to provide a better trade-off between higher system utilization versus job loss upon failures, and will develop techniques to fine tune the frequency and location of checkpoints with PROGNOSIS. More importantly, the confidence level behind the prediction that is needed for online decision making will be evaluated, and the effect of inaccurate predictions.

Keywords: runtime checking, checkpointing, reliability, recovery