MODA: Monitoring and Operational Data Analytics for HPC Systems
Funding: University of Basel
Duration: 01.08.2020 – present
Project Summary
The goal of this project is to improve HPC operations and research regarding system performance, resilience, and efficiency. The performance optimization aspect targets optimal resource allocation and job scheduling. The resilience aspect strives to ensure orderly operations when facing anomalies or misuse, this includes security mechanisms against malicious applications. The efficiency aspect is about resource management and energy efficiency of HPC systems.
To this end, appropriate techniques are employed to (a) monitor the system and collect data, such as sensor data, system logs, and job resource usage, (b) analyze system data through statistical and machine learning methods, and (c) make control and tuning decisions to optimize the system and avoid waste and misuse of computing power.
The operational ideals that this project follows are (a) to gain a data-driven understanding of the system instead of operating it like a black box, (b) to continuously monitor all system states and application behavior, (c) to holistically consider the interaction between system states and application behavior, and (d) to develop solutions that can detect and resolve performance issues autonomously.
Conference Publications
T. Jakobsche and F. M. Ciorba, “Using Malware Detection Techniques for HPC Application Classification,” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC24) Workshops, Third Annual Workshop on Cyber Security in High Performance Computing (S-HPC’24), Atlanta, GA, USA, 2024. [C82.bib] (online)
F. Boito, J. Brandt, V. Cardellini, P. Carns, F. M. Ciorba, H. Egan, A. Eleliemy, A. Gentile, T. Gruber, J. Hanson, U.-U. Haus, K. Huck, T. Ilsche, T. Jakobsche, T. Jones, S. Karlsson, A. Mueen, M. Ott, T. Patki, I. Peng, K. Raghavan, S. Simms, K. Shoga, M. Showerman, D. Tiwari, T. Wilde, K. Yamamoto. “Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations.” In Proceedings of The 10th Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Cluster, Santa Fe, New Mexico, USA, October, 2023. [C75.bib] (online)
T. Vasilas, T. Jakobsche, and F. M. Ciorba. “Hot-n-Cold: Mapping the Syscall Attack Surface Using Thermal Side Channels”. In Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Computing (ISPDC), Bucharest, Romania, July 10-12, 2023. [C72.bib] (online)
T. Jakobsche, N. Lachiche, and F. M. Ciorba. “Investigating HPC Job Resource Requests and Job Efficiency Reporting”. In Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Computing (ISPDC), Bucharest, Romania, July 10-12, 2023. [C71.bib] (online)
T. Jakobsche, N. Lachiche, and F. M. Ciorba. “Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers”. Second International Symposium on Quantitative Codesign of Supercomputers of the International Conference for High Performance Computing, Networking, Storage and Analysis 2022 (SC22). to appear November 13, 2022. [C67.bib]
T. Jakobsche, N. Lachiche, A. Cavelan, and F. M. Ciorba. “An Execution Fingerprint Dictionary for HPC Application Recognition”. In Proceedings of the Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA 2021) of the IEEE International Conference on Cluster Computing (Cluster 2021), Portland, OR, USA, virtual, September 2021, [C63.bib]
Contributed Conference Presentations
“Using Malware Detection Techniques for HPC Application Classification”
Speaker: T. Jakobsche
Scientific paper presented at the International Conference for High Performance Computing, Networking, Storage and Analysis (SC24) Workshops, Third Annual Workshop on Cyber Security in High Performance Computing (S-HPC’24), Atlanta, GA, USA, 2024.
“Investigating HPC Job Resource Requests and Job Efficiency Reporting”
Speaker: F. M. Ciorba
Scientific paper presented at the 22nd IEEE International Symposium on Parallel and Distributed Computing (ISPDC), Bucharest, Romania, July 10-12, 2023.
“Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers”
Speaker: T. Jakobsche
Scientific paper presented at the Second International Symposium on Quantitative Codesign of Supercomputers of the International Conference for High Performance Computing, Networking, Storage and Analysis 2022 (SC22). Dallas, TX, USA, November 2022.
“An Execution Fingerprint Dictionary for HPC Application Recognition”
Speaker: T. Jakobsche
Scientific paper presented at the Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA 2021) of the IEEE International Conference on Cluster Computing (Cluster 2021), Portland, OR, USA, virtual, September 2021.
Organization of Workshops
MODA25: Fourth International Workshop on “Monitoring and Operational Data Analytics”
General Co-Chairs: Thomas Jakobsche, Torsten Wilde, Ann Gentile.
Steering Board: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2025. June 13, 2025, Hamburg (Germany).
MODA24: Fourth International Workshop on “Monitoring and Operational Data Analytics”
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2024. May 16, 2024, Hamburg (Germany).
MODA23: Fourth International Workshop on “Monitoring and Operational Data Analytics”
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2023. May 25, 2023, Hamburg (Germany).
MODA22: Third International Workshop on “Monitoring and Operational Data Analytics”
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2022. June 2, 2022, Hamburg (Germany).
MODA21: Second International Workshop on “Monitoring and Operational Data Analytics”.
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2021. July 2, 2021, virtual.
MODA20: First International Workshop on “Monitoring and Operational Data Analytics”.
General Co-Chairs: Florina Ciorba, Utz-Uwe Haus, Nicolas Lachiche, Martin Schulz.
Jointly held with ISC HPC 2020. June 25, 2020, Frankfurt (Germany).