Massively parallel systems engage the HPC community for the next 20 years to design new generations of applications and simulation platforms.
The challenge is particularly severe for multi-physics, multi-scale simulation platforms.
Combining massively parallel software components that are developed independently from each other is a main concern. Other roadblock relates to legacy codes, as codes constantly evolve to remain at the forefront of their disciplines.
Peta or Exa scale computers will meet the challenges with new numerical methods, code architectures, mesh generation tools or visualization tools. Beyond applications, all software layers between the applications and the hardware need to be revisited.
The 4 scientific and technical recognised challenges
Scalability: none of runtime environment allows application execution on 1 million of nodes. There is no known solution to launch 1 million of processes on large scale machines in less than 5 minutes.
Fault tolerance: a must to run applications on 1 million of nodes for hours.
Two significant issues:
- the Mean Time Before Failure of very large computers is rapidly diminishing and will soon equal the time required for fault tolerance systems to simply restart applications
- the exponential increase of the number of transistors and their exponential reduction in size with time, will significantly increase the number of “masked errors” that could be not detected by any system (silent soft errors).
Programming approach: the programming environment should deal with hierarchy, heterogeneity, flexibility and help the developer make the program scalable.
Other challenges: overhead, computational precisions, energy saving, etc.