Abstract:
Inappropriate parallelism parameter may result in the performance degradation on in-memory computing framework. For this issue, we analyze the execution mechanism of Spark jobs, establish job scheduling model, and give the definition of the computing cost, resource idle rate and spill probability. Based on the analysis of the relationship between parallelism parameter and job execution efficiency, the optimization objective of algorithm is given. To solve the problem of optimizing, a parallelism deduction algorithm (PDA) for in-memory computing framework is proposed. Firstly, PDA calculates the best parallelism of job execution by size of input data, worker computing resource and additional overhead of spill, and thus enhances the resource utilization of cluster and speeds up the state synchronization of job execution. The algorithm optimizes the task scheduling for each Stage, accelerates the job execution and improves the calculation efficiency. Experiment results demonstrate that the proposed algorithm can improve the computational efficiency of in-memory computing framework and accelerate data-intensive and compute-intensive applications.