CAREER - Aon - An Integrative Approach to Petascale Fault Tolerance
Principal Investigator: Tom Hacker
Advances in computing power over the past two decades have driven successive generations of powerful supercomputers. Petascale systems have recently emerged that contain tens of thousands of processors. At this scale, frequent component and software faults cause parallel applications to fail often, forcing users to checkpoint at an unsustainable scalce and pace, wasting resources and triggering additional faults. Fault tolerance relies on hardware and software redundancy: spatial, temporal, information coding, and hybrid methods that combine these techniques. This project aims to improve the reliability and efficiency of high performance computing systems through a comprehensive approach to fault detection, prediction, response, and recovery.
Students: Jason St. John