CERIAS - Center for Education and Research in Information Assurance and Security

Skip Navigation
CERIAS Logo
Purdue University
Center for Education and Research in Information Assurance and Security

CAREER - Aon - An Integrative Approach to Petascale Fault Tolerance

Principal Investigator: Tom Hacker

Advances in computing power over the past two decades have driven successive generations of powerful supercomputers.  Petascale systems have recently emerged that contain tens of thousands of processors.  At this scale, frequent component and software faults cause parallel applications to fail often, forcing users to checkpoint at an unsustainable scalce and pace, wasting resources and triggering additional faults.  Fault tolerance relies on hardware and software redundancy: spatial, temporal, information coding, and hybrid methods that combine these techniques.  This project aims to improve the reliability and efficiency of high performance computing systems through a comprehensive approach to fault detection, prediction, response, and recovery.

Personnel

Students: Jason St. John