IEEE Transactions on Parallel and Distributed Systems (TPDS)
Amelie Chi Zhou1 Weilin Xue1 Yao Xiao1 Bingsheng He2 Shadi Ibrahim3 Reynold Cheng4
1Shenzhen University 2National University of Singapore 3Inria 4University of Hong Kong
Abstract
In many data-intensive applications, workflow is often used as an important model for organizing data processing tasks and resource provisioning is an important and challenging problem for improving the performance of workflows. Recently, system variations in the cloud and large-scale clusters, such as those in I/O and network performances and failure events, have been observed to greatly affect the performance of workflows. Traditional resource provisioning methods, which overlook these variations, can lead to suboptimal resource provisioning results. In this article, we provide a general solution for workflow performance optimizations considering system variations. Specifically, we model system dynamics as time-dependent random variables and take their probability distributions as optimization input. Despite its effectiveness, this solution involves heavy computation overhead. Thus, we propose three pruning techniques to simplify workflow structure and reduce the probability evaluation overhead. We implement our techniques in a runtime library, which allows users to incorporate efficient probabilistic optimization into existing resource provisioning methods. Experiments show that probabilistic solutions can improve the performance by up to 65 percent compared to state-of-the-art static solutions, and our pruning techniques can greatly reduce the overhead of our probabilistic approach.
Fig. 1. (a) Spatial and (b) temporal features of the I/O and network per-formance distributions of Windows Azure instances.
Fig. 2. Failure dynamics in Google trace: (a) failure interval distributions of four types of tasks; (b) relationship between task execution time and MTBF.
Fig. 4. An example of pre-processing for Montage workflow.
Fig. 8. Normalized results of budget-constrained scheduling under tight and loose budget on Amazon EC2.
Fig. 10. Normalized results of budget-constrained scheduling under tight and loose budget on Windows Azure.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61802260, in part by the Shenzhen Science and Technology Foundation under Grant JCYJ20180305125737520, and in part by the Natural Science Foundation of SZU under Grant 000370. The work of Bingsheng He was supported in part by a collaborative grant from Microsoft Research Asia. The work of Shadi Ibrahim work was supported by ANR KerStream Project under Grant ANR-16-CE25-0014-01. The work of Reynold Cheng was supported in part by the Research Grants Council of HK through RGC Projects HKU under Grants 17229116, 106150091, and 17205115, the University of HK under Grants 104004572, 102009508, and 104004129, and in part by the Innovation&Technology Commission of HK through ITF Project MRP/029/18.
Bibtex
@ARTICLE{9462122,
author={Zhou, Amelie Chi and Xue, Weilin and Xiao, Yao and He, Bingsheng and Ibrahim, Shadi and Cheng, Reynold},
journal={IEEE Transactions on Parallel and Distributed Systems},
title={Taming System Dynamics on Resource Optimization for Data Processing Workflows: A Probabilistic Approach},
year={2022},
volume={33},
number={1},
pages={231-248},
}
Downloads