SELECTION OF BACHELOR, MASTER THESES AND MASTER PROJECTS
The HPC group offers several hot and interesting topics for bachelor and master theses, as well as individual projects at the bachelor and master levels in the area of parallel and distributed computing. Come and join our team!
Below is a list of topics for theses and projects that are only an example of what you could work on in our team.
This list is not actively maintained. Therefore, interested students please contact us for further details on an existing topic, for updates on novel hot topics, or to discuss a topic of your own interest.
LB4MPI, a Modern MPI Load Balancing Library in C ++
Load imbalance across distributed memory compute nodes is a critical performance degradation factor. The goal of this work is to modernize the code of DLS4LB library into a C ++ MPI load balancing library. The library should be able to handle the distribution of computations as well as the distribution of data. Application data can be centralized, replicated, or distributed. LB4MPI library should be able to learn data distribution from the user and to adjust this distribution dynamically during execution.
Fault Tolerance of Silent Data Corruption (SDC) in Scientific Applications
Silent data corruptions are very common in modern HPC systems. SDCs can occur due to bit flips in memory or system buses that do not directly cause a failure of the system but rather could alter the final result of the application. Replication is an established fault tolerance method. Robust dynamic load balancing (rDLB) is a robust scheduling method for parallel loops that employ replication to tolerate failures and severe perturbations in computing systems. Selective particle replication (SPR) is a method for the detection of silent data corruptions in smoothed particle hydrodynamics (SPH) simulations.The goal of this work is to combine the SPR approach with the rDLB, ie, particles (loop iterations) selected by SPR for replication, will be schedule and load balanced using the rDLB, to achieve an SDC tolerant, load-balanced, high-performance SPH simulation.
Web Portal Extension for HPC Job Accounting Data Visualization and Analysis
The HPC group of the University of Basel operates a high performance computing system for research purposes (miniHPC) and a web portal for application performance analysis (PAP: Performance Analysis Portal). The miniHPC stores job accounting data (e.g. resource usage) for all jobs that are executed on the system. The data is stored in a database on the system itself. This project aims to export or transfer job accounting data from the miniHPC to the web portal for further analysis and visualization. Having the data in a web-based accessible format will help HPC researchers to visualize and analyze HPC job accounting data more efficiently and can expose new insights into HPC job behavior. The objectives of this project are: (a) Export or transfer job accounting data to the web portal. (b) Extend the web portal with the appropriate database technology. (c) Implement functionality to browse, sort, and display HPC job information. (d) Implement functionality to visualize and analyze job accounting data.
Symbol-Based HPC Application Executable Classification
HPC jobs typically contain applications that are to be executed on the HPC system. The applications inside a job are executables (compiled source code) and often have non-descriptive names. HPC system operators and HPC researchers do not really know which application is inside a job, although this knowledge could be beneficial for job scheduling and HPC research. A step towards knowledge of which application is inside a job, is symbol-based executable classification. The linux “nm” command allows us to extract symbols (e.g. function and variable names) from executables. We can use these symbols in combination with text classification to classify executables into application classes. Ideally, we want to return the application name of a given (unknown) executable. The objectives of this project are: (a) Explore the linux “nm” command that extracts symbols from object files. (b) Use the “nm” command to extract symbols from application executables. (c) Determine which symbols are important or needed to classify executables. (d) Implement text classification to classify executables into application classes.
HPC Job Visualization with Prometheus and Grafana
The HPC group of the University of Basel operates a high performance computing system for research purposes (miniHPC) and a web portal for application performance analysis (PAP: Performance Analysis Portal). SLURM can create information about jobs that are currently running on the system. This project aims to collect, store, and process job data of SLURM through Prometheus (data processing + database) and subsequently visualize job data through Grafana. Creating job-level dashboards can help us understand the current state of the HPC system by giving us a better overview of which jobs are being executed and how they use the system. The objectives are: (a) Install and configure Prometheus and Grafana on the HPC group’s miniHPC, (b) collect, process, and store SLURM job data with the Prometheus monitoring tool, (c) visualize SLURM job data through Grafana by creating / using existing dashboards, and (d) assess the benefits of the Prometheus-Grafana setup and the resulting visualization.
Dynamic loop scheduling at scale
Load imbalance in scientific applications is one of the most performance degradation factors. Dynamic loop scheduling (DLS) is essential to improve the performance of applications, especially when scaling to a large number of processing elements. The goal of this work is to examine the performance of various applications with different DLS techniques while scaling (strong and weak) and assess the usefulness and effectiveness of DLS techniques at a large scale. Experiments could use native experimentation to the limit of the available HPC resources and simulations using our in-house loop scheduling simulator LoopSim.
Multi-level robust scheduling
High performance computing (HPC) systems offer multiple levels of parallelism, eg core, sockets, and nodes. In return, HPC software stack usually supports multiple levels of parallelism corresponding to the HW levels of parallelism, eg, thread and process levels. Various scheduling methods are employed at every level of hardware and software parallelism (more information is on the MLS project page ). The goal of this project is to use scheduling information from various levels of parallelism and employ it for fault tolerance.
Algorithms and Experiments for Quantum Computing (Master Thesis)
Quantum computing (QC) is radically different from the conventional computing approach. Based on quantum bits that can be zero and one at the same time, a quantum computer acts as a massively parallel device with an exponentially large number of computations taking place at the same time. This will make problems tractable that are non-tractable even for the most powerful classical supercomputers. While the physics behind QC has been explored hundred years ago, implementations are still in an early development state. But major companies as well as research funding agencies currently massively invest in this direction. In the master thesis, you will explore this fascinating field and get hands-on experience on QC simulators and early systems.
What is your name, benchmark scheduler?
There are numerous benchmarks and parallel workloads available in the HPC community. They are believed to employ very good schedulers. The documentation accompanying these workloads does not provide the details about the scheduling techniques/algorithms involved therein. During this thesis, scheduling algorithms will be identified in HPC workloads and the findings will be assessed comparatively.
Identification and analysis of the communication behavior of parallel applications
The execution of applications on parallel computing systems requires that application processes communicate during their execution. Understanding the communication behavior of parallel applications is important for optimizing their parallel execution. The communication patterns can be represented as process graphs (or networks) and/or task graphs. This work involves (1) the identification and classification of communication behavior types from various synthetic and real parallel applications and (2) the investigation of the similarity and differences between the process graphs and the task graphs of single parallel applications. To realize this work synthetic communication patterns may be developed and the communication behavior of real applications will be extracted and classified based on their execution traces.
From OTF2 traces to the SimGrid toolkit
OTF2 refers to the open trace format (version 2), a format used to store the execution traces of applications as a sequence of events. Understanding the traces helps in analyzing the behavior of the applications during execution. The goal is to develop a tool that reads OTF2 trace files as input and extract the structure of the application, execution times, and use this information to develop a simulator that simulates the application using SimGrid simulation framework programming interfaces. The developed tool will be used to automatically create inputs for simulating the execution of parallel applications by reading their execution traces.
Efficient Task Scheduling on Heterogeneous Devices
Scheduling task-graph on heterogeneous devices CPUs and GPUs is necessary in modern computing platforms. Moreover, with the variety that HPC needs to manage (HPC, ML, and Big Data), we would like to implement efficient task scheduling on CPUs and GPUs. List scheduling is one of the algorithms used to schedule tasks with data dependencies. Tasks with dependencies will be scheduled based on the available computing resources according to their priority and platform target. We use list scheduling over multiple computing platforms and evaluate how the scheduling can be affected, altered, and migrated regarding the task computation platform requirement, optimal makespan, and incurred synchronization costs needed to balance the task to resource allocation. The objectives are: (1) Analyze the existing task scheduling algorithm with data dependencies on heterogeneous devices e.g HeteroPrioDep; (2) Evaluate the task-scheduling on heterogeneous devices regarding the computation, communication, and synchronization costs.