MODA: Monitoring and Operational Data Analytics for HPC Systems


Funding: University of Basel

Duration: 01.08.2020 – present

Project Summary

The goal of this project is to improve HPC operations and research regarding system performance, resilience, and efficiency. The performance optimization aspect targets optimal resource allocation and job scheduling. The resilience aspect strives to ensure orderly operations when facing anomalies or misuse, this includes security mechanisms against malicious applications. The efficiency aspect is about resource management and energy efficiency of HPC systems.

To this end, appropriate techniques are employed to (a) monitor the system and collect data, such as sensor data, system logs, and job resource usage, (b) analyze system data through statistical and machine learning methods, and (c) make control and tuning decisions to optimize the system and avoid waste and misuse of computing power.

The operational ideals that this project follows are (a) to gain a data-driven understanding of the system instead of operating it like a black box, (b) to continuously monitor all system states and application behavior, (c) to holistically consider the interaction between system states and application behavior, and (d) to develop solutions that can detect and resolve performance issues autonomously.

Conference Publications

T. Jakobsche, N. Lachiche, and F. M. Ciorba. “Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers”. Second International Symposium on Quantitative Codesign of Supercomputers of the International Conference for High Performance Computing, Networking, Storage and Analysis 2022 (SC22). to appear November 13, 2022. https://arxiv.org/abs/2209.07164. [C67.bib]

T. Jakobsche, N. Lachiche, A. Cavelan, and F. M. Ciorba. “An Execution Fingerprint Dictionary for HPC Application Recognition”. In Proceedings of the Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA 2021) of the IEEE International Conference on Cluster Computing (Cluster 2021), Portland, OR, USA, virtual, September 2021, arxiv.org/abs/2109.04766. [C63.bib]

Contributed Conference Presentations

“Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers”
Speaker: T. Jakobsche
Scientific paper presented at the Second International Symposium on Quantitative Codesign of Supercomputers of the International Conference for High Performance Computing, Networking, Storage and Analysis 2022 (SC22). Dallas, TX, USA, November 2022.

“An Execution Fingerprint Dictionary for HPC Application Recognition”
Speaker: T. Jakobsche
Scientific paper presented at the Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA 2021) of the IEEE International Conference on Cluster Computing (Cluster 2021), Portland, OR, USA, virtual, September 2021.

Organization of Workshops

MODA23: Fourth International Workshop on “Monitoring and Operational Data Analytics”
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2023. May 25, 2023, Hamburg (Germany). https://moda.dmi.unibas.ch.

MODA22: Third International Workshop on “Monitoring and Operational Data Analytics”
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2022. June 2, 2022, Hamburg (Germany). https://moda.dmi.unibas.ch.

MODA21: Second International Workshop on “Monitoring and Operational Data Analytics”.
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2021. July 2, 2021, virtual. https://moda21.sciencesconf.org.

MODA20: First International Workshop on “Monitoring and Operational Data Analytics”.
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2020. June 25, 2020, Frankfurt (Germany). https://moda20.sciencesconf.org.