Cross cutting issues recommendations
Big Data for extreme computing, Resilience, Uncertainties and Validation, are key issues on which working groups concluded to propose urgent recommendations
This section reports the work done by EESI2 in the first year of activity.
The activity proceeded first focusing better the state of the art of the topics addressed, then continued to better understand the evolutions in the domains and trying to identify a gap analysis and some recommendations for approaching the Exascale goal.
As the activity was new to EESI2, the gap analysis activity has to be refined in the next reporting period.
Discover the WP5/ D5.1 gap analysis and recommendations
End-to-end techniques, transformational algorithms to address extreme concurrency, asynchrony or resilience, advanced analytics algorithms or metadata specification are some of the key elements to name few, that need to be further investigated :
Set up actions to address end-to-end techniques for efficient disruptive I/O and data analysis, to describe the full life-cycle of data for a set of applications in order to produce highly parallel data workflows that are consistent all the way from the production to the analysis of the data while considering locality, structures, metadata, right accesses, quality of service, sharing etc.
Promote research in transformational algorithms to address fundamental challenges in extreme concurrency, asynchronous parallel data movement and access patterns, new alternative execution models, supporting asynchronous irregular applications and resilience, to enhance data analytics and computational methods in big data scientific applications.
Promote research in advanced data analytics algorithms and techniques, adopting new disruptive methodologies, to face the analysis of the big data deluge advancing in different scientific disciplines.
This research should also promote and support the adoption of efficient metadata specification, management and interoperability in different scientific disciplines, as a key element to govern the scientific discovery process.
Uncertainty analysis methodologies in academic and industrial studies require new competences, different from simulation code development.
Numerical methods need to integrate model errors beyond the traditional uncertainty parameter studies.
Tools and middleware will facilitate a wider usage of uncertainty methodologies.
Additionally, efficient parallelisation require multiple levels of parallelism fo which new developing tools are needed.
It is important to adopt uncertainty analysis in academic and industrial studies. The use of the uncertainty analysis methodologies requires competence that is somewhat different from the ones required to develop a simulation code, and a key issue is that of training.
Investment is also required in numerical methods. In order to deploy uncertainty analysis on highly CPU-consuming codes, two strategies should be followed: improving adaptive designs of experiments and progressing on surrogate models.
Furthermore, traditional uncertainty analysis deals mostly with parameter uncertainty, a huge progress for the validation of scientific codes would be achieved by also taking into account model errors.
The software tools are very important for facilitating the uncertainty analysis dissemination in the numerical simulation community. Investment on tools and middleware taking into account the problem of resilience to failures is important to make more robust uncertainty tools and therefore to facilitate their wider usage.
Last, modern multiphysics computations involve multiple levels of parallelism (domain decomposition, code coupling, multiscale, etc.). Support developing tools that ensure these different levels of parallelism should be combined with the ones related to the design of experiments for efficient parallelisation of the ensemble.
Power & performance
Joint academia and industry efforts are needed to deliver outcomes such as Performance Application Programming Interface (PAPI), best practices or benchmarks that will guide the improvements in energy efficiency.
Developers should be trained and helped with manual of tips and tricks for green programming.
By-en-large, experts and Centres of Excellence in performance analysis will support the wider community of users to make efficient use of Exascale systems
Support the development of standard interfaces for power monitoring and power management at all levels of the system architecture. This would need to involve industry and academia. This joined effort will have several outcomes, which could include extensions to performance monitoring standards, such as the Performance Application Programming Interface (PAPI), or the creation of a set of best practices on how to operate systems in an energy efficient manner. This effort should also produce energy efficiency benchmarks to guide and monitor the improvements in energy efficiency.
Define major training and education initiative to prepare developers to face the power wall challenge by applying energy-aware programming techniques. A manual of tips and tricks for green programming would also be an extremely valuable resource for the HPC community.
We need more experts and professional HPC developers to support the wider community with the more efficient use of the expensive Peta and Exascale systems.
Centres of Excellence in performance analysis should be created to help users get acquainted with the available tools, with one-to-one hands-on tutorials provided by tools experts. Ideally these would be based on the users’ own codes.
Improve checkpoint/restart performance by improving Multi-level checkpoint/restart (by minimizing the overhead of copying checkpoint images between the different storage levels), leveraging application and data properties (like memory access patterns, redundancy across multiple processes data structures) to enhance checkpointing asynchrony, may show ways to improve protocols in presence of fail stop errors.
Improve fault tolerance protocols to increase system efficiency and execution recovery performance in presence of fail stop errors. This requires to understand how message logging can accelerate recovery state inconsistency, how to leverage partial restart to improve system efficiency, how to exploit new MPI concepts like neighbor collectives and RMA. More fundamentally more exploration is needed to refine the notion of global state consistency in the context of HPC executions and take advantage of it.
Investigate alternatives to checkpoint/restart. This covers improving fault tolerance approaches based on task-based programming/execution models, developing new concept of application level process migration, improving replication to reduce its overhead in resources. Develop a fault aware software stack. This requires that software involved in the resilience (including applications, runtime, OS, etc.) should be fault aware and a notification/coordination infrastructure should guarantee relevant and consistent notifications/decisions/actions between these software layers.
Improve failure prediction and proactive actions. There are essentially two main research problems:
- increase significantly the number of correctly predicted failures and
- design failure prediction workflow to work with extremely large and growing system data sets (>1GB per day).
Improve resilient algorithms for fail stop errors and data corruption and their integration in the global resilience design. In particular the composability of resilience algorithms with other fault tolerance solutions should be explored. Global state consistency in the context of HPC could deliver further advantages.
Task-based programming/execution models and application level process migration could improve replication and reduce overheads.
Other directions include fault aware software stack, where a notification/coordination infrastructure could guarantee consistent activities between software layers.
Failure prediction should decrease incorrect alerts and better work with extremely large system data sets, above 1GB per day.
Resilient algorithms for fail stop errors and data corruption and their integration in the global resilience design are to be improved.
The following recommendations are suggested to easily take-up disruptive technologies whenever available:
- I/O and Memory disruption: analyse alternatives to parallel file systems, set application I/O functionalities at an higher level (Data container) or better data locality, promote tiered memory and I/O systems
- Cooling technologies and facility management: energy aware monitoring systems, schedulers and applications. Join efforts with energy company, or re-use heat.
- Network infrastructure: adaptive topology to enhance the routing capability, active network chips to process “on the fly” data, direct end-to-end data exchange technology could reduce overheads and congestions
- Data transfer: photonic technology adoption and HPC and BigData synergy may accelerate the roadmap
- Semiconductor technology: fine grain resource management could mitigate extreme parallelism, tiny “codelets” like dataflow or intelligent scheduling close to data could avoid costly movements.