BACHELOR, MASTER THESES AND INDIVIDUAL MASTER PROJECTS
The HPC group offers several hot and interesting topics for bachelor and master theses, as well as individual projects at the bachelor and master levels in the area of parallel and distributed computing. Come and join our team !
Below is a list of topics for theses and projects that are only an example of what you could work on in our team.
This list is not actively maintained. Therefore, interested students please contact us for further details on an existing topic, for updates on novel hot topics, or to discuss a topic of your own interest.
Self-scheduling for tasks in OpenMP
Scientific applications are the cornerstone of computational sciences. OpenMP is usually used to express and exploit parallelism in scientific applications on shared memory compute nodes. Parallelism can be expressed in tasks or parallel loops in OpenMP. While loop scheduling is well studied and understood, there are still some points to explore in OpenMP task scheduling. Specifically, there is no tool for visualizing OpenMP tasks execution (with task IDs) to examine their scheduling. Also, comparing the performance of applications using loops versus tasking, ie, converting loops into tasks, with decreasing chunk sizes according to various loop scheduling techniques, has not been attempted before. Lastly, changing the OpenMP runtime library to improve its task scheduling strategies, would be of great value for modern scientific applications.here .
HPC and BigData collocation: An evaluation study of the MLS prototype
HPC and BigData workloads have different characteristics. On the one hand, HPC workloads mostly contain rigid jobs. Rigid jobs have a very restricted set of requirements, and they start without having the required resources available. Nevertheless, the allocation of a rigid job cannot be changed during its execution. On the other hand, BigData workload mostly contains malleable jobs. The resource allocation for malleable jobs can be changed during their execution. Convergence between HPC and BigData is inevitable, and already happening at hardware and the software layers. In this master thesis, we aim to study the feasibility and performance impact of collocating HPC and BigData workloads on an HPC cluster called miniHPC. This master thesis will follow closely the earlier research effort which suggested and implemented a framework to connect traditional resource and job management systems (RJMSs) of both HPC and BigData systems.
Simulating Fluids in SPH Without SPH: a Machine Learning Take on Fluid Dynamics
Machine learning is a flourishing field due to it’s recent successes in beating human accuracy for classification tasks. With correct design, these algorithms can outperform any other methods in classification, prediction, segmentation, and even generating content. These models operate on two phases: training and inference. To achieve their peak performance, these algorithms require long training times. Using hardware accelerators and High Performance Computing (HPC) is a necessary step to achieve efficient training times. However, once trained, their inference times are relatively short. We will use machine learning algorithms to predict the result of simulated physics phenomena. This process is computationally-intensive. Therefore, we would like to train a model that will rapidly estimate the evolution of three physics experiments that are supported by SPH-EXA. SPH-EXA is a simulation framework developed at the University of Basel (the HPC group at DMI in collaboration with sciCORE) to perform hydrodynamics simulations for cosmology and astrophysics. It will be used to generate the training data and ground truth validation. The model will be trained to simulate the experiments supported by SPH-EXA, namely: a Sedov explosion, a wind blob, and a square patch.
Simplifying Coroutines for C++20
Coroutines are essential for “clean” concurrent programming. C++20 added coroutine support. They extend the C++ STL coroutine implementation to support: (1) Simple generators, (2) Awaitable tasks: the task can be run as a separate thread or sequentially using the library, (3) Concurrency management. They also provide write primitives for concurrent I/O using the library. The objectives of this thesis are: (1) Design and implement an STD extension for coroutines, (2) Evaluate the performance of the library against cppcoro. The objectives can be achieved by writing a test case that uses the coroutines library generators for instance and compare the standard C++ implementation, the CppCoro implementation and our implementation.
Support for Static Data in Performance Analysis Portal
Current HPC system designs strive for more performance by relying on larger numbers of processing units that work in parallel, instead of further developing the speed of an individual processing unit. To exploit the parallelism provided by these HPC systems, scientific applications increasingly use parallel programming paradigms. The performance analysis of applications programmed with such models can help to solve large-scale scientific problems more efficiently. To facilitate performance analysis of parallel applications, we implemented the Performance Analysis Portal (PAP). PAP is a Web-based tool that allows the storage and analysis of data collected from parallel application execution. PAP provides different interfaces that allow the user to insert data into a NoSQL database (MongoDB) and that allow the visualization and analysis of such data. PAP was initially designed to store and provide analysis for performance data collected from the execution of parallel applications. However, important information can also be automatically collected just by examining the application code without execution (static analysis). Static analysis can provide information such as the usage of MPI routines or calls, usage of OpenMP, OpenACC, or CUDA, among others. Therefore, PAP requires an extension to allow the storage and analysis of static data collected from parallel applications.
The Sound of Computing
Auditory perception has advantages in temporal, spatial, amplitude, and frequency resolution that open possibilities as an alternative or complement to visualization techniques. The CPU usage graph is one example of a visualization technique to monitor and represent an application or a whole computer system. Sonification of the same CPU usage data can complement the understanding and representation of data and open new paths for data analysis. This analysis can include but is not limited to the identification of performance drops, idle time, scheduling, and load balancing issues (processing elements playing “out of tune”). Researchers are constantly looking for the best way to represent their data and extract useful insights. Sonification puts HPC data into a non-traditional format and provides a relatively unexplored approach to convey information and perceptualize data. Data and different metrics can be collected from an in-house HPC system. Sonification can be achieved with an open-source solution or a custom program. TwoTone is a good start to play around with the concept of sonification.
Scheduling of Applications in the Cloud
Many cloud service providers, such as Amazon, Microsoft, and Google, offer compute hours on their servers with very good performance. Recently, Google cloud announced the support for HPC workloads and efficient communications on clusters on the cloud with Message Passing Interface (MPI). Additionally, many cloud service providers offer free computing hours, especially for university accounts. This makes cloud service a very attractive option for running small workloads and benchmarks to test their performance and scalability on cloud resources. However, cloud computing systems are usually built from commodity (consumer-level) hardware not high performance components. Also, cloud resources are shared among the users of the cloud and they do not guarantee the exclusive access to resources similar to HPC centers. Therefore, the goal of this thesis is to evaluate the performance of HPC workloads on the cloud.
Dynamic loop self-scheduling on distributed-memory systems with Python
Loops are considered the primary source of parallelism in various scientific applications. Scheduling loop iterations across multiple computing resources is a challenging task, i.e., the execution must be balanced across all computing. Several factors can hinder such a balanced execution, and consequently, degrade application performance. Specifically, problem characteristics, non-uniform input data sets, as well as algorithmic and systemic variations lead to different execution times of each loop iteration. Dynamic loop self-scheduling (DLS) techniques mitigate such factors. DLS techniques were originally devised for shared-memory systems. A recently developed MPI library, called LB4MPI enables the use of various DLS techniques on distributed-memory systems. LB4MPI has two versions: one for C and one for Fortran programs. C and Fortran are often used to write scientific applications such as weather forecasting and N-body simulations. At the same time, Python has emerged over the last couple of decades as a first-class data science tool. This project aims to design and implement a Python version or interface for the existing LB4MPI library.
Support of Centralized Data in DLS4LB
DLS4LB is an MPI-Based load balancing library. It is implemented in C and FORTRAN (F90) programming languages to support scientific applications executed on HPC systems. DLS4LB improves the performance of applications by employing DLS techniques for load balancing of loops across distributed memory compute nodes. The DLS4LB library supports fourteen scheduling techniques. DLS4LB currently requires that application data are replicated on all compute nodes, as it only handles the distribution of the computations. The goal of this work is to extend the DLS4LB library to support the distribution of data with computations from a centralized queue at the master node.
LB4MPI, a Modern MPI Load Balancing Library in C ++
Load imbalance across distributed memory compute nodes is a critical performance degradation factor. The goal of this work is to modernize the code of DLS4LB library into a C ++ MPI load balancing library. The library should be able to handle the distribution of computations as well as the distribution of data. Application data can be centralized, replicated, or distributed. LB4MPI library should be able to learn data distribution from the user and to adjust this distribution dynamically during execution.
Fault Tolerance of Silent Data Corruption (SDC) in Scientific Applications
Silent data corruptions are very common in modern HPC systems. SDCs can occur due to bit flips in memory or system buses that do not directly cause a failure of the system but rather could alter the final result of the application. Replication is an established fault tolerance method. Robust dynamic load balancing ( rDLB ) is a robust scheduling method for parallel loops that employs replication to tolerate failures and severe perturbations in computing systems. Selective particle replication ( SPR ) is a method for the detection of silent data corruptions in smoothed particle hydrodynamics (SPH) simulations.The goal of this work is to combine the SPR approach with the rDLB, ie, particles (loop iterations) selected by SPR for replication, will be schedule and load balanced using the rDLB, to achieve a SDC tolerant, load balanced, high-performance SPH simulation.
Dynamic loop scheduling at scale
Load imbalance in scientific applications is one of the most performance degradation factors. Dynamic loop scheduling (DLS) is essential to improve the performance of applications, especially when scaling to a large number of processing elements. The goal of this work is to examine the performance of various applications with different DLS techniques while scaling (strong and weak) and assess the usefulness and effectiveness of DLS techniques at large scale. Experiments could use native experimentation to the limit of the available HPC resources and simulations using our in-house loop scheduling simulator LoopSim .
Multi-level robust scheduling
High performance computing (HPC) systems offer multiple levels of parallelism, eg core, sockets, and nodes. In return, HPC software stack usually supports multiple levels of parallelism corresponding to the HW levels of parallelism, eg, thread and process levels. Various scheduling methods are employed at every level of hardware and software parallelism (more information is on the MLS project page ). The goal of this project is to use scheduling information from various levels of parallelism and employ it for fault tolerance.
Improving the Performance of an appMRI Hippocampus Volume Analyzer (HVA) (Master Thesis / Project) – Co-supervision with MIAC
Currently, the processing time for creating an appMRI HVA report is approximately 3 hours, excluding the human quality control. The main goal of this project is to study and understand how the algorithm (FreeSurfer 5.3 or FreeSurfer 6.0 – latest release) could be improved so that the computation time to calculate the volume of the hippocampus is significantly decreased. appMRI HVA algorithm (FreeSurfer 5.3) already relies on OpenMP parallelization to speed some operations. Ideally, we would like to identify and modify new routines that could benefit from this approach.Additionally, we would like to identify the optimal configuration of an appMRI cluster node (number of threads / cores per job, CPU, memory, etc.) to obtain the best performance of the appMRI HVA infrastructure.
Algorithms and Experiments for Quantum Computing (Master Thesis)
Quantum computing (QC) is radically different from the conventional computing approach. Based on quantum bits that can be zero and one at the same time, a quantum computer acts as a massive parallel device with an exponentially large number of computations taking place at the same time. This will make problems tractable that are non-tractable even for the most powerful classical supercomputers. While the physics behind QC has been explored hundred years ago, implementations are still in an early development state. But major companies as well as research funding agencies currently massively invest in this direction. In the master thesis you will explore this fascinating field and get hands-on experience on QC simulators and early systems.
FPGA-Based Accelerators for High-Performance Computing (Bachelor / Master Thesis)
Field-programmable devices such as Field-Programmable Gate Array (FPGA) technology are a hybrid of hardware and software. Integrated circuits consist of thousands of basic computing blocks which both offer hardware acceleration and application-specific programmability. Thus, FPGA devices can act as accelerators: Compute-intensive program parts are executed on the FPGA co-processor while run-time organization and other program parts are run on a standard CPU. In this thesis you will study the potential of using FPGA in High Performance Computing comparing the performance against standard CPUs for specific applications (ex. Machine learning)
A parallel debugger for MPI applications
Having an open source MPI debugger, is a step on the road to educational parallel debugger, customized debuggers, and free license debuggers. Serial debuggers like gdb or lldb have Machine Interface (GDB / MI or LLDB / MI) that is used by many debuggers in different IDEs. You have to integrate them to behave as shown in the demo using your friendly GUI application.
A visualizer for job logs in batch systems for high performance computing clusters
Converting batch system logs from file to database style to help organizations to share their data for commercial or scientific purposes with less efforts and with more privacy options. This work includes implement of a tool that collects batch systems logs and visualizes the utilization statistics, convert batch system logs from file to database style, and implement a user friendly web interface for viewing usage statistics.
What is your name, benchmark scheduler?
There are numerous benchmarks and parallel workloads available in the HPC community. They are believed to employ very good schedulers. The documentation accompanying these workloads does not provide the details about the scheduling techniques / algorithms involved therein. During this thesis, scheduling algorithms will be identified in HPC workloads and the findings will be assessed comparatively.
A visualization tool for job schedulers in HPC
Build a visualization tool to visualize the status of the jobs and the queues of the schedulers and jobs on a HPC system. Using information from qstat , qhost , qquota , and information available from batch logs to build the information to be displayed.
Performance comparison of parallel programming paradigms on miniHPC
Study the performance of different parallel programming models on miniHPC, explore which programming model performs the best, explain the performances obtained from the benchmarks and how it relates to architecture / software stack, optimize compilation of benchmarks for miniHPC architecture, possibility tune the benchmark to achieve the best possible performance in every programming model.
Performance engineering with stencil kernels and codes
It is an important motif and widely used, for this reason researchers try to optimize its computation and several stencil compilers have been implemented. Focus of your thesis is to study 2 of them: PLUTO (v 0.11) and Girih. The first exploits the polyhedral model to optimize the loops with affine transformations, whereas the latter is mainly used to develop and analyze the performance of Multi-core Wavefront Diamond (MWD) tiling techniques, which are used to perform temporal blocking. Once implemented a test case and executed an experiment, the results should be compared against Roofline Model and ECM Model in order to understand how the approaches exploit the available hardware.
Development of a stencil kernel and application benchmark
The stencil computational pattern is representative of several numerical code, where it usually represents an important part of the execution time. In said codes, the stencil part span from 2 dimensional to 3 dimensional grids, high order to low order, varying the arithmetic intensity. Several tools and implementations are available. The question you are going to answer is: “Given a stencil belonging to a certain category, what is the best choice for its compilation?” You will implement a test case from each category in OpenMP, PLUTO and PATUS (the latter 2 being stencil compilers) and benchmark the produced outputs.
Managing of shared experiment workspaces among different HPC systems
Conducting research experiments in Computational Science is not only a matter of writing code but also of configuring the software used for running it on complex high performance computing (HPC) systems. Manually configuring the software drives, often leads to non-reproducible experiments in terms of either pure execution or final results. Furthermore, a key aspect for a scientist who carries on an experiment is to have the possibility to collaborate in a simple and effective way with another scientist, this can be more difficult when using HPC systems: An HPC system is usually closed environment accessible, unless special configuration, only using a local accounts.Such a local account can’t be used for accessing a different system and, most likely, will not give a full control to the machine (eg installing new software). The HPC Group (formerly HPWC) is currently developing a framework called “PROVA!” with the aim of managing and sharing HPC experiments to further a collaborative research.
The scope of the thesis is to analyze pros and cons of different approaches to the shared workspaces in order to propose a solution suitable for the HPC field and integrate it in “PROVA!”.
Identification and analysis of the communication behavior of parallel applications
The execution of applications on parallel computing systems requires that application processes communicate during their execution. Understanding the communication behavior of parallel applications is important for optimizing their parallel execution. The communication patterns can be represented as process graphs (or networks) and / or task graphs. This work involves (1) identification and classification of communication behavior types from various synthetic and real parallel applications and (2) investigation of the similarity and differences between the process graphs and the task graphs of single parallel applications.To realize this work synthetic communication patterns may be developed and the communication behavior of real applications will be extracted and classified based on their execution traces.
From OTF2 traces to the SimGrid toolkit
OTF2 refers to the open trace format (version 2), a format used to store the execution traces of applications as a sequence of events. Understanding the traces helps in analyzing the behavior of the applications during execution. The goal is to develop a tool that reads OTF2 trace files as input and extract the structure of the application, execution times, and use this information to develop a simulator that simulates the application using SimGrid simulation framework programming interfaces. The developed tool will be used to automatically create inputs for simulating the execution of parallel applications by reading their execution traces.