Improve checkpoint/restart performance by improving Multi-level checkpoint/restart (by minimizing the overhead of copying checkpoint images between the different storage levels), leveraging application and data properties (like memory access patterns, redundancy across multiple processes data structures) to enhance checkpointing asynchrony, may show ways to improve protocols in presence of fail stop errors.

Improve fault tolerance protocols to increase system efficiency and execution recovery performance in presence of fail stop errors. This requires to understand how message logging can accelerate recovery state inconsistency, how to leverage partial restart to improve system efficiency, how to exploit new MPI concepts like neighbor collectives and RMA. More fundamentally more exploration is needed to refine the notion of global state consistency in the context of HPC executions and take advantage of it.

Investigate alternatives to checkpoint/restart. This covers improving fault tolerance approaches based on task-based programming/execution models, developing new concept of application level process migration, improving replication to reduce its overhead in resources. Develop a fault aware software stack. This requires that software involved in the resilience (including applications, runtime, OS, etc.) should be fault aware and a notification/coordination infrastructure should guarantee relevant and consistent notifications/decisions/actions between these software layers.

Improve failure prediction and proactive actions. There are essentially two main research problems:

  • increase significantly the number of correctly predicted failures and
  • design failure prediction workflow to work with extremely large and growing system data sets (>1GB per day).

Improve resilient algorithms for fail stop errors and data corruption and their integration in the global resilience design. In particular the composability of resilience algorithms with other fault tolerance solutions should be explored. Global state consistency in the context of HPC could deliver further advantages.

Task-based programming/execution models and application level process migration could improve replication and reduce overheads.

Other directions include fault aware software stack, where a notification/coordination infrastructure could guarantee consistent activities between software layers.

Failure prediction should decrease incorrect alerts and better work with extremely large system data sets, above 1GB per day.

Resilient algorithms for fail stop errors and data corruption and their integration in the global resilience design are to be improved.