Modeling the Performance of MapReduce Applications for the Cloud

Iván Carrera; Claudio Geyer

Iván Carrera National Polytechnic School
Claudio Geyer Federal University of Rio Grande do Sul

Keywords: MapReduce, Cloud, Hadoop, FLOPs, MRBS, Performance

Abstract

In the last years, Cloud Computing has become a key technology that made possible to run applications without needing to deploy a physical infrastructure. The challenge with deploying distributed applications in Cloud Computing environments is that the virtual machine infrastructure should be planned in a time and cost-effective way.
This work is a summary of a previous work presented by the authors as a Master’s thesis, with the goal of showing that the execution time of a distributed MapReduce application, running in a Cloud computing environment, can be predicted using a mathematical model based on theoretical speciﬁcations. This prediction is made to help the users of the Cloud Computing environment to plan their deployments, i.e., quantify the number of virtual machines and its characteristics. After measuring the application execution time and varying parameters stated in the mathematical model, and after that, using a linear regression technique, the goal is achieved ﬁnding a model of the execution time which was then applied to predict the execution time of MapReduce applications. Experiments were conducted in several conﬁgurations and showed a clear relation with the theoretical model, revealing that the model is in fact able to predict the execution time of MapReduce applications. The developed model is generic, meaning that it uses theoretical abstractions for the computing capacity of the environment and the computing cost of the MapReduce application.

Downloads

Download data is not yet available.

Author Biography

Iván Carrera, National Polytechnic School

References

P. Mell and T. Grance, “The nist definition of cloud computing (draft),”NIST special publication, vol. 800, p. 145, 2011.

K. Yelick, S. Coghlan, B. Draney, R. S. Canonet al., “The magellanreport on cloud computing for science,”US Department of Energy Officeof Science, Office of Advanced Scientific Computing Research (ASCR)December, 2011.

J. Dean and S. Ghemawat, “Mapreduce: simplified data processing onlarge clusters,”Communications of the ACM, vol. 51, no. 1, pp. 107–113,2008.

S. Babu, “Towards automatic optimization of mapreduce programs,” inProceedings of the 1st ACM symposium on Cloud computing.ACM,2010, pp. 137–142.

H. Herodotou, F. Dong, and S. Babu, “No one (cluster) size fits all:automatic cluster sizing for data-intensive analytics,” inProceedings ofthe 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 18.

R. Boutaba, L. Cheng, and Q. Zhang, “On cloud computational modelsand the heterogeneity challenge,”Journal of Internet Services andApplications, vol. 3, no. 1, pp. 77–86, 2012.

I. Carrera Izurieta and C. Geyer, “Performance modeling ofmapreduce applications for the cloud,” Master’s thesis, UniversidadeFederaldoRioGrandedoSul,2014.[Online].Available:”http://hdl.handle.net/10183/99055”

I. Carrera and C. Geyer, “Impressionism in cloud computing. a positionpaper on capacity planning in cloud computing environments,” inPro-ceedings of the 15th International Conference on Enterprise InformationSystems (ICEIS). INSTICC, 2013, pp. 333–338.

H. Herodotou, “Hadoop performance models. technical reportcs-2011-05,”Duke Computer Science, 2011. [Online]. Available:”http://www.cs.duke.edu/starfish/files/hadoop-models.pdf”

F. Tian and K. Chen, “Towards optimal resource provisioning for runningmapreduce programs in public clouds,” inCloud Computing (CLOUD),2011 IEEE International Conference on. IEEE, 2011, pp. 155–162.

H. Karloff, S. Suri, and S. Vassilvitskii, “A model of computation formapreduce,” inProceedings of the Twenty-First Annual ACM-SIAMSymposium on Discrete Algorithms. Society for Industrial and AppliedMathematics, 2010, pp. 938–948.

D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of mapreduce:An in-depth study,”Proceedings of the VLDB Endowment, vol. 3, no.1-2, pp. 472–483, 2010.

Hadoop, 2013, apache Hadoop https://www.grid5000.fr/ accessed on12/28/2013.

T. White,Hadoop: the definitive guide. O’Reilly, 2012.

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoopdistributed file system,” inMass Storage Systems and Technologies(MSST), 2010 IEEE 26th Symposium on. IEEE, 2010, pp. 1–10.

EMR, 2013, amazon Web Services - EMR Elastic MapReducehttp://aws.amazon.com/elasticmapreduce accessed on 07/23/2013.

EC2, 2013, amazon Web Services - EC2 Elastic Compute Cloudhttp://aws.amazon.com/ec2 accessed on 07/23/2013.

A. Iosup, S. Ostermann, M. N. Yigitbasi, R. Prodan, T. Fahringer,and D. H. Epema, “Performance analysis of cloud computing servicesfor many-tasks scientific computing,”Parallel and Distributed Systems,IEEE Transactions on, vol. 22, no. 6, pp. 931–945, 2011.

HDInsight, 2013, windowsAzureHDInsighthttp://azure.microsoft.com/en-us/documentation/services/hdinsight/accessed on 12/02/2014.

A. Sangroya, D. Serrano, and S. Bouchenak, “Benchmarking depend-ability of mapreduce systems,” inReliable Distributed Systems (SRDS),2012 IEEE 31st Symposium on. IEEE, 2012, pp. 21–30.

O. OMalley, “Terabyte sort on apache hadoop,”Yahoo, available onlineat: http://sortbenchmark. org/Yahoo-Hadoop. pdf, pp. 1–3, 2008.

I. Carrera, F. Scariot, P. Turin, and C. Geyer, “An example for perfor-mance prediction for map reduce applications in cloud environments,”inEscola Regional de Redes de Computadores ERRC - RS Rio Grandedo Sul, 2013.

R. Jain,The art of computer systems performance analysis. John Wiley& Sons Chichester, 1991, vol. 182.

R,R: A Language and Environment for Statistical Computing, RFoundation for Statistical Computing, Vienna, Austria, 2011, ISBN 3-900051-07-0 http://www.R-project.org/.