The HPC group offers several hot and interesting topics for bachelor and master theses, as well as individual projects at the bachelor and master levels in the area of ​​parallel and distributed computing. Come and join our team !

Check out the Completed Theses and Student Projects  for earlier work.

Below is a list of topics for theses and projects that are only an example of what you could work on in our team.
This list is not actively maintained. Therefore, interested students please contact us for further details on an existing topic, for updates on novel hot topics, or to discuss a topic of your own interest.


Self-scheduling for Tasks in OpenMP

Scientific applications are the cornerstone of computational sciences. OpenMP is usually used to express and exploit parallelism in scientific applications on shared memory compute nodes. Parallelism can be expressed in tasks or parallel loops in OpenMP. While loop scheduling is well studied and understood, there are still some points to explore in OpenMP task scheduling. Specifically, there is no tool for visualizing OpenMP tasks execution (with task IDs) to examine their scheduling. Also, comparing the performance of applications using loops versus tasking, i.e., converting loops into tasks, with decreasing chunk sizes according to various loop scheduling techniques, has not been attempted before. Lastly, changing the OpenMP runtime library to improve its task scheduling strategies, would be of great value for modern scientific applications. Preliminary work and results regarding task versus loop self-scheduling can be found here.


Support of Centralized Data Distribution in DLS4LB

DLS4LB is an MPI-Based load balancing library. It is implemented in C and FORTRAN (F90) programming languages ​​to support scientific applications executed on HPC systems. DLS4LB improves the performance of applications by employing DLS techniques for load balancing of loops across distributed memory compute nodes. The DLS4LB library supports fourteen scheduling techniques. DLS4LB currently requires that application data are replicated on all compute nodes, as it only handles the distribution of the computations. The goal of this work is to extend the DLS4LB library to support the distribution of data with computations from a centralized queue at the master node.  


LB4MPI, a Modern MPI Load Balancing Library in C ++

Load imbalance across distributed memory compute nodes is a critical performance degradation factor. The goal of this work is to modernize the code of DLS4LB library into a C ++ MPI load balancing library. The library should be able to handle the distribution of computations as well as the distribution of data. Application data can be centralized, replicated, or distributed. Lb4MPI library should be able to learn data distribution from the user and to adjust this distribution dynamically during execution.  


Fault Tolerance of Silent Data Corruption (SDC) in Scientific Applications

Silent data corruptions are very common in modern HPC systems. SDCs can occur due to bit flips in memory or system buses that do not directly cause a failure of the system but rather could alter the final result of the application. Replication is an established fault tolerance method. Robust dynamic load balancing ( rDLB ) is a robust scheduling method for parallel loops that employs replication to tolerate failures and severe perturbations in computing systems. Selective particle replication ( SPR ) is a method for the detection of silent data corruptions in smoothed particle hydrodynamics (SPH) simulations. The goal of this work is to combine the SPR approach with the rDLB, ie, particles (loop iterations) selected by SPR for replication, will be schedule and load balanced using the rDLB, to achieve a SDC tolerant, load balanced, high-performance SPH simulation. 


Dynamic Loop Scheduling at Scale

Load imbalance in scientific applications is one of the most performance degradation factors. Dynamic loop scheduling (DLS) is essential to improve the performance of applications, especially when scaling to a large number of processing elements. The goal of this work is to examine the performance of various applications with different DLS techniques while scaling (strong and weak) and assess the usefulness and effectiveness of DLS techniques at large scale. Experiments could use native experimentation to the limit of the available HPC resources and simulations using our in-house loop scheduling simulator LoopSim


Multi-level Robust Scheduling 

High performance computing (HPC) systems offer multiple levels of parallelism, e.g. core, sockets, and nodes. In return, HPC software stack usually supports multiple levels of parallelism corresponding to the HW levels of parallelism, e.g., thread and process levels. Various scheduling methods are employed at every level of hardware and software parallelism (more information is on the MLS project page). The goal of this project is to use scheduling information from various levels of parallelism and employ it for fault tolerance.


Improving the Performance of an appMRI Hippocampus Volume Analyzer (HVA) (Master Thesis / Project) – Co-supervision with MIAC

Currently, the processing time for creating an appMRI HVA report is approximately 3 hours, excluding the human quality control. The main goal of this project is to study and understand how the algorithm (FreeSurfer 5.3 or FreeSurfer 6.0 – latest release) could be improved so that the computation time to calculate the volume of the hippocampus is significantly decreased. appMRI HVA algorithm (FreeSurfer 5.3) already relies on OpenMP parallelization to speed some operations. Ideally, we would like to identify and modify new routines that could benefit from this approach.Additionally, we would like to identify the optimal configuration of an appMRI cluster node (number of threads / cores per job, CPU, memory, etc.) to obtain the best performance of the appMRI HVA infrastructure.


Algorithms and Experiments for Quantum Computing (Master Thesis)

Quantum computing (QC) is radically different from the conventional computing approach. Based on quantum bits that can be zero and one at the same time, a quantum computer acts as a massive parallel device with an exponentially large number of computations taking place at the same time. This will make problems tractable that are non-tractable even for the most powerful classical supercomputers. While the physics behind QC has been explored hundred years ago, implementations are still in an early development state. But major companies as well as research funding agencies currently massively invest in this direction. In the master thesis you will explore this fascinating field and get hands-on experience on QC simulators and early systems.


FPGA-Based Accelerators for High-Performance Computing (Bachelor / Master Thesis)

Field-programmable devices such as Field-Programmable Gate Array (FPGA) technology are a hybrid of hardware and software. Integrated circuits consist of thousands of basic computing blocks which both offer hardware acceleration and application-specific programmability. Thus, FPGA devices can act as accelerators: Compute-intensive program parts are executed on the FPGA co-processor while run-time organization and other program parts are run on a standard CPU. In this thesis you will study the potential of using FPGA in High Performance Computing comparing the performance against standard CPUs for specific applications (ex. Machine learning)


A parallel debugger for MPI applications

Having an open source MPI debugger, is a step on the road to educational parallel debugger, customized debuggers, and free license debuggers. Serial debuggers like gdb or lldb have Machine Interface (GDB / MI or LLDB / MI) that is used by many debuggers in different IDEs. You have to integrate them to behave as shown in the demo using your friendly GUI application.


A visualizer for job logs in batch systems for high performance computing clusters

Converting batch system logs from file to database style to help organizations to share their data for commercial or scientific purposes with less efforts and with more privacy options. This work includes implement of a tool that collects batch systems logs and visualizes the utilization statistics, convert batch system logs from file to database style, and implement a user friendly web interface for viewing usage statistics.


What is your name, benchmark scheduler?

There are numerous benchmarks and parallel workloads available in the HPC community. They are believed to employ very good schedulers. The documentation accompanying these workloads does not provide the details about the scheduling techniques / algorithms involved therein. During this thesis, scheduling algorithms will be identified in HPC workloads and the findings will be assessed comparatively.


A visualization tool for job schedulers in HPC

Build a visualization tool to visualize the status of the jobs and the queues of the schedulers and jobs on a HPC system. Using information from qstat , qhost , qquota , and information available from batch logs to build the information to be displayed.


Performance comparison of parallel programming paradigms on miniHPC

Study the performance of different parallel programming models on miniHPC, explore which programming model performs the best, explain the performances obtained from the benchmarks and how it relates to architecture / software stack, optimize compilation of benchmarks for miniHPC architecture, possibility tune the benchmark to achieve the best possible performance in every programming model.


Performance engineering with stencil kernels and codes

It is an important motif and widely used, for this reason researchers try to optimize its computation and several stencil compilers have been implemented. Focus of your thesis is to study 2 of them: PLUTO (v 0.11) and Girih. The first exploits the polyhedral model to optimize the loops with affine transformations, whereas the latter is mainly used to develop and analyze the performance of Multi-core Wavefront Diamond (MWD) tiling techniques, which are used to perform temporal blocking. Once implemented a test case and executed an experiment, the results should be compared against Roofline Model and ECM Model in order to understand how the approaches exploit the available hardware.


Development of a stencil kernel and application benchmark

The stencil computational pattern is representative of several numerical code, where it usually represents an important part of the execution time. In said codes, the stencil part span from 2 dimensional to 3 dimensional grids, high order to low order, varying the arithmetic intensity. Several tools and implementations are available. The question you are going to answer is: “Given a stencil belonging to a certain category, what is the best choice for its compilation?” You will implement a test case from each category in OpenMP, PLUTO and PATUS (the latter 2 being stencil compilers) and benchmark the produced outputs.


Managing of shared experiment workspaces among different HPC systems

Conducting research experiments in Computational Science is not only a matter of writing code but also of configuring the software used for running it on complex high performance computing (HPC) systems. Manually configuring the software drives, often leads to non-reproducible experiments in terms of either pure execution or final results. Furthermore, a key aspect for a scientist who carries on an experiment is to have the possibility to collaborate in a simple and effective way with another scientist, this can be more difficult when using HPC systems: An HPC system is usually closed environment accessible, unless special configuration, only using a local accounts. Such a local account can’t be used for accessing a different system and, most likely, will not give a full control to the machine (eg installing new software). The HPC Group (formerly HPWC) is currently developing a framework called “PROVA!” with the aim of managing and sharing HPC experiments to further a collaborative research.
The scope of the thesis is to analyze pros and cons of different approaches to the shared workspaces in order to propose a solution suitable for the HPC field and integrate it in “PROVA!”.


Identification and analysis of the communication behavior of parallel applications

The execution of applications on parallel computing systems requires that application processes communicate during their execution. Understanding the communication behavior of parallel applications is important for optimizing their parallel execution. The communication patterns can be represented as process graphs (or networks) and / or task graphs. This work involves (1) identification and classification of communication behavior types from various synthetic and real parallel applications and (2) investigation of the similarity and differences between the process graphs and the task graphs of single parallel applications. To realize this work synthetic communication patterns may be developed and the communication behavior of real applications will be extracted and classified based on their execution traces.


From OTF2 traces to the SimGrid toolkit

OTF2 refers to the open trace format (version 2), a format used to store the execution traces of applications as a sequence of events. Understanding the traces helps in analyzing the behavior of the applications during execution. The goal is to develop a tool that reads OTF2 trace files as input and extract the structure of the application, execution times, and use this information to develop a simulator that simulates the application using SimGrid simulation framework programming interfaces. The developed tool will be used to automatically create inputs for simulating the execution of parallel applications by reading their execution traces.