Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning


Coordinator: Know-Center GmbH, Austria

Partners: 13 partners from 7 European countries

Know-Center GmbH (Austria), AVL List GmbH (Austria), Deutsches Zentrum für Luft- und Raumfahrt EV (Germany), ETHZ (Switzerland), Hasso Pattner Institute (Germany), ICCS (Greece), Infineon Technologies (Austria), Intel (Poland), IT University Copenhagen (Denmark), KAI GmbH (Austria), TU Dresden (Germany), University of Maribor (Slovenia), University of Basel (Switzerland).

Funding agency: European Union via the Horizon 2020 programme

Duration: 01.12.2020-30.11.2024

Central project webpage: daphne-eu.github.io

Project Summary
The DAPHNE project aims to define and build an open and extensible system infrastructure for integrated data analysis pipelines, including data management and processing, high-performance computing (HPC), and machine learning (ML) training and scoring. Key observations are that:
(1) systems of these areas share many compilation and runtime techniques,
(2) there is a trend towards complex data analysis pipelines that combine these systems, and
(3) the used, increasingly heterogeneous, hardware infrastructure converges as well.

Yet, the programming paradigms, cluster resource management, as well as data formats and representations differ substantially. Therefore, this project aims – with a joint consortium of experts from the data management, ML systems, and HPC communities – to systematically investigating the necessary system infrastructure, language abstractions, compilation, and runtime techniques, as well as systems and tools necessary to increase the productivity when building such data analysis pipelines, and eliminating unnecessary performance bottlenecks.

At the University of Basel
Throughout the hierarchy from integrated pipelines in distributed environments down to specialized storage and accelerator devices, scheduling of tasks (which may represent operation- and data-bundles) is crucial for achieving high system utilization, throughput, and latency.

The HPC group at the University of Basel will develop scheduling mechanisms for the different system hierarchy levels, including compilation and runtime techniques.