MODA: Monitoring and Operational Data Analytics for HPC Systems


Funding: University of Basel

Duration: 01.08.2022 – present

Project Summary

The goal of this project is to improve HPC operations and research regarding system performance, resilience, and efficiency. The performance optimization aspect targets optimal resource allocation and job scheduling. The resilience aspect strives to ensure orderly operations when facing anomalies or misuse, this includes security mechanisms against malicious applications. The efficiency aspect is about resource management and energy efficiency of HPC systems.

To this end, appropriate techniques are employed to (a) monitor the system and collect data, such as sensor data, system logs, and job resource usage, (b) analyze system data through statistical and machine learning methods, and (c) make control and tuning decisions to optimize the system and avoid waste and misuse of computing power.

The operational ideals that this project follows are (a) to gain a data-driven understanding of the system instead of operating it like a black box, (b) to continuously monitor all system states and application behavior, (c) to holistically consider the interaction between system states and application behavior, and (d) to develop solutions that can detect and resolve performance issues autonomously.

Publications

T. Jakobsche, N. Lachiche, and F. M. Ciorba. “Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers”. Second International Symposium on Quantitative Codesign of Supercomputers of the International Conference for High Performance Computing, Networking, Storage and Analysis 2022 (SC22). to appear November 13, 2022. https://arxiv.org/abs/2209.07164. [C67.bib]

T. Jakobsche, N. Lachiche, A. Cavelan, and F. M. Ciorba. “An Execution Fingerprint Dictionary for HPC Application Recognition”. In Proceedings of the Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA 2021) of the IEEE International Conference on Cluster Computing (Cluster 2021), Portland, OR, USA, virtual, September 2021, arxiv.org/abs/2109.04766. [C63.bib]

Workshops

MODA22: Third International Workshop on “Monitoring and Operational Data Analytics”
Jointly held with ISC HPC 2021. June 2, 2022, Hamburg (Germany). https://moda.dmi.unibas.ch.

MODA21: Second International Workshop on “Monitoring and Operational Data Analytics”.
Jointly held with ISC HPC 2021. July 2, 2021, virtual. https://moda21.sciencesconf.org.

MODA20: First International Workshop on “Monitoring and Operational Data Analytics”.
Jointly held with ISC HPC 2020. June 25, 2020, Frankfurt (Germany). https://moda20.sciencesconf.org.