Resource-Efficient and Reliable Systems for Data-Intensive Applications

Aiming to make it easier to run data-intensive applications efficiently on computing infrastructures from small devices to large-scale clusters, I work towards a more adaptive resource management, together with research students and with collaborators, following an iterative systems research approach.

Data-Intensive Applications on Increasingly Diverse Distributed Infrastructures

Today, many organizations have to deal with very large volumes of collected data, be it to search through billions of websites, to recommend songs or TV shows to millions of users, to accurately identify genetic disorders by comparing terabytes of human genomic data, to monitor current environment conditions in urban areas using large sensor networks, or to detect fraudulent behavior in millions of business transactions. For this, businesses, sciences, municipalities, as well as other large and small organizations deploy data-intensive applications on to scalable fault-tolerant distributed systems and large-scale distributed computing infrastructures. Prominent examples of such scalable systems include distributed dataflow and workflow systems, scalable storage and database systems, and parallel systems for machine learning and graph processing.

The computing infrastructures used for data-intensive applications are becoming increasingly diverse, distributed, and dynamic. Beyond the data center, there are more edge and fog resources as well as IoT devices. This enables to run applications closer to data sources and users, allowing for lower latencies, improved security and privacy, and reduced energy consumption for wide-area networking, but also creates distinctively heterogeneous new computing environments. At the same time, also data centers are becoming more diverse. In public clouds, users can choose among hundreds of different virtual machines, including instances optimized for compute, memory, and storage or for accelerated computing. Similarly, dedicated cluster infrastructures at larger organizations are also becoming more heterogeneous. Scientists at universities, for instance, often have access to several clusters, each potentially again with multiple different types of machines, such as machines equipped with large amounts of memory or graphics processors.

Difficulty of Running Data-Intensive Applications Efficiently

The data-intensive applications that run on these distributed systems and infrastructures are created by many different users. These include software engineers, system operators, data analysts, machine learning specialists, scientists, and domain experts, yet typically also users without a strong background and deep understanding of parallel computer architectures and networking, distributed systems theory and implementation, as well as efficient data management. At the same time, it is still very difficult to run data-intensive applications on today’s diverse and dynamic distributed computing environments, so that the applications provide the required performance and reliability, yet also run efficiently. Users are largely left alone with the question, how much and which resources they should use for their applications. Meanwhile, configuring scalable fault-tolerant distributed systems, so that they run as required on particular computing infrastructures, is frequently not straightforward even for expert users. Anticipating the runtime behavior of these systems on certain infrastructures is difficult as it depends on several factors, there are usually many options to configure and large parameter spaces, and environments and workloads often change dynamically over time.

As a result, running data-intensive applications efficiently is hard, especially when given requirements for an application's performance and reliability. In fact, users often resort to overprovisioning to ensure that said requirements are met after all. I further argue that – while high-level programming abstractions and comprehensive processing frameworks have made it easier to develop data-intensive applications – efficiently operating systems and infrastructures for data-intensive applications has become more difficult over the last decade. And there is abundant evidence of low resource utilization, limited energy efficiency, and severe failures with applications deployed in practice that back up this claim, while computing's environmental footprint already rivals aviation's and is projected to increase further over the next few years. Therefore, as an increasing number of data-intensive applications is developed and deployed in businesses, in the sciences, and by municipalities and governments, it is from both an economical and an ecological point of view absolutely critical that computing infrastructures are used efficiently.

Methods for a More Adaptive Resource Management

The main objective of my work is supporting organizations and users in making efficient use of computing infrastructures for their data-intensive applications. Towards this goal, I work with research students and collaborators to develop methods, systems, and tools that make the implementation, testing, and operation of resource-efficient and sustainable data-intensive applications easier. Ultimately, we aim to realize systems that adapt to dynamic workloads, computing infrastructures, and given performance and reliability requirements fully automatically.

More specifically, we work towards an adaptive resource management for data-intensive applications that run in distributed computing environments, from small IoT devices to large clusters of virtual resources, in three ways:

  • Adaptive Resource Allocation: We envision systems that automatically select adequate resources (e.g., types of resources, scale-outs, usage of accelerators, communication channels) and adapt virtual infrastructures (e.g., virtual machines, containers, networks) as needed to meet specific performance objectives and constraints.
  • Dynamic Scaling & Scheduling: We envision systems that automatically adjust to changes in computing environments at runtime (e.g., re-scheduling, task migration, failure handling) based on continuous monitoring, predicted loads, detected failures, and historic behaviors in similar situations.
  • Automatic System Tuning: We envision systems that automatically tune themselves (e.g., cache and buffer sizes, internal memory allocations, failure tolerance strategies, snapshotting frequency) on the basis of previous executions of similar applications, dedicated profiling runs, and general performance models.

Central for realizing an adaptive and data-driven resource management according to high-level user-defined objectives and constraints are techniques that enable an effective modeling of the performance, reliability, and efficiency of data-intensive applications (and, therefore, optimization). Additionally, we investigate methods for resource-efficient monitoring, profiling, and experimentation.

Empirical Systems Research Methodology

I mostly do empirical systems research. Therefore, together with research students and with collaborators, I evaluate new ideas by implementing them prototypically in context of relevant open-source systems (such as Flink, Kubernetes, and FreeRTOS) and by conducting experiments on actual hardware, with exemplary applications, and real-world data. For this, we use state-of-the-art infrastructures, including diverse cluster infrastructures, private and public clouds, as well as IoT devices and sensors.

I believe iterative processes and short feedback cycles are vital for research. I am, consequently, implementing a multi-staged approach to research: first presenting new ideas in focused workshops and work-in-progress tracks of conferences, then submitting rigorously researched results to the main tracks of renowned international conferences, before compiling extensive findings into comprehensive journal articles. At the same time, I am convinced that it is essential to also be involved in applied and interdisciplinary research projects, to directly experience relevant problems and to uncover opportunities for well motivated, impactful research.

Believing in the value of scientific discourse and feedback, I interact and collaborate frequently with other research groups and actively participate in the international scientific community, taking on academic services. Moreover, I make results available to the public as far and as soon as possible with openly accessible publications, open-source software prototypes, and research-based university teaching.