Resource-Efficient Distributed Computer Systems for Data-Intensive Applications

Aiming to make it easier to run data-intensive applications more efficiently on computing infrastructures from small devices to large-scale clusters, I work towards a more adaptive management of applications and resources, together with collaborators and partners, following an iterative systems research approach.

Distributed Processing Systems Run On Increasingly Diverse Computing Infrastructures

Today, many organizations have to deal with very large volumes of collected data, be it to search through billions of websites, to recommend songs or TV shows to millions of users, to accurately identify genetic disorders by comparing terabytes of human genomic data, to monitor current environment conditions in urban areas using large sensor networks, or to detect fraudulent behavior in millions of business transactions. For this, businesses, sciences, municipalities, as well as other large and small organizations deploy data-intensive applications on to scalable fault-tolerant distributed systems and large-scale distributed computing infrastructures. Prominent examples of such scalable systems include distributed dataflow and workflow systems, scalable storage and database systems, and parallel systems for machine learning and graph processing.

Meanwhile, the computing infrastructures used for data-intensive applications are becoming increasingly diverse, distributed, and dynamic. Beyond the data center, there are more edge and fog resources as well as IoT devices. This enables to run applications closer to data sources and users, allowing for lower latencies, improved security and privacy, and reduced energy consumption for wide-area networking, but also creates distinctively heterogeneous new computing environments. At the same time, also data centers are becoming more diverse. In public clouds, users can choose among hundreds of different virtual machines, including instances optimized for compute, memory, and storage or for accelerated computing. Similarly, dedicated cluster infrastructures at larger organizations are becoming more heterogeneous. Scientists at universities, for instance, often have access to several clusters, each potentially again with multiple different types of machines, such as machines equipped with large amounts of memory or graphics processors.

Running Data-Intensive Applications Efficiently Is Becoming a Key Challenge

The data-intensive applications that run on these distributed systems and infrastructures are created by many different users. These include software engineers, system operators, data analysts, machine learning specialists, scientists, and domain experts, yet typically also users without a strong background and deep understanding of parallel computer architectures and networking, distributed systems theory and implementation, as well as efficient data management and processing. Though, it is still very difficult to run scalable systems on today’s diverse and dynamic distributed computing environments, so that applications provide the required performance and dependability, yet also run efficiently. Users are largely left alone with the question, how much and which resources they should allocate for their applications. Meanwhile, configuring distributed systems, so that they run as required on a particular computing infrastructure, is frequently not straightforward even for expert users. Anticipating the behavior of systems on certain infrastructures is inherently difficult. It depends on many factors, there are usually numerous options to configure and large parameter spaces, and environments and workloads often change dynamically over time.

As a result, running data-intensive applications efficiently is hard, especially when given requirements for an application's performance and dependability. In fact, users often resort to overprovisioning to ensure that their requirements are met after all. I would even argue that – while high-level programming abstractions and distributed runtimes have made it easier to develop data-intensive applications – managing systems and infrastructures efficiently has become more difficult over the last decade. And there is abundant evidence of low resource utilization, limited energy efficiency, and severe failures with applications deployed in practice that back up this claim. Meanwhile, computing's environmental footprint already rivals aviation's and is projected to rise sharply over the next few decades. Therefore, as an increasing number of data-intensive applications is developed and deployed in businesses, in the sciences, and by municipalities and governments, it is from both an economical and an ecological point of view absolutely critical that computing infrastructures are used efficiently.

We Need New Methods for a More Adaptive Management of Applications and Resources

The main objective of my work is supporting organizations and users in making efficient use of computing infrastructures for their applications. Towards this goal, I work with collaborators and partners to develop methods, systems, and tools that make the implementation, testing, and operation of resource-efficient and resilient distributed systems easier. Ultimately, we aim to realize systems that adapt to diverse computing infrastructures, dynamic workloads, and high-level performance and reliability requirements fully automatically.

In line with this, we work towards an adaptive resource management for data-intensive applications that run on distributed computing infrastructures – from small IoT devices to large clusters of virtual resources – in three ways:

  • Adaptive Resource Allocation: We envision systems that automatically select adequate resources (e.g., types of resources, scale-outs, usage of accelerators, communication channels) and adapt virtual infrastructures (e.g., virtual machines, containers, networks) as needed to meet specific performance objectives and constraints.
  • Dynamic Scaling & Scheduling: We envision systems that automatically adjust to changes in computing environments at runtime (e.g., re-scheduling, task migration, failure handling) based on continuous monitoring, predicted loads, detected failures, and historic behaviors in similar situations.
  • Automatic System Tuning: We envision systems that automatically tune themselves (e.g., cache and buffer sizes, internal memory allocations, failure tolerance strategies, snapshotting frequency) on the basis of previous executions of similar applications, dedicated profiling runs, and general performance models.

Central for realizing an adaptive and data-driven resource management according to high-level objectives and constraints are techniques that enable an effective modeling of the performance, dependability, and efficiency of applications and, therefore, optimization. Additionally, we investigate methods for resource-efficient monitoring, profiling, and experimentation.

Empirical Systems Research Methodology

I mostly do empirical systems research. Therefore, together with collaborators and partners, I evaluate new ideas by implementing them prototypically in context of relevant open-source systems (such as Flink, Kubernetes, and FreeRTOS) and by conducting experiments on actual hardware, with exemplary applications, and real-world data. For this, we use state-of-the-art infrastructures, including diverse cluster infrastructures, private and public clouds, as well as IoT devices and sensors.

I believe iterative processes and short feedback cycles are vital for research. I am, consequently, implementing a multi-staged approach to research: first presenting new ideas in focused workshops and work-in-progress tracks of conferences, then submitting rigorous results to the main tracks of renowned international conferences, and finally compiling more extensive findings into comprehensive journal articles. At the same time, I am convinced that it is essential to also be involved in applied and interdisciplinary research projects, to directly experience relevant problems and to uncover new opportunities for well motivated, impactful research.

Believing in the value of scientific discourse and feedback, I frequently collaborate with researchers from other institutions and actively participate in the international scientific community, taking on academic services. Moreover, I make results available to the public as far and as soon as possible with openly accessible publications, open-source software prototypes, and research-based university teaching.