大规模数据处理云计算.ppt
《大规模数据处理云计算.ppt》由会员分享,可在线阅读,更多相关《大规模数据处理云计算.ppt(39页珍藏版)》请在得力文库 - 分享文档赚钱的网站上搜索。
1、大规模数据处理大规模数据处理/云计算云计算 Lecture 3 MapReduce Basics闫宏飞北京大学信息科学技术学院7/12/2011http:/ work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http:/creativecommons.org/licenses/by-nc-sa/3.0/us/for detailsJimmy LinUniversity of Maryland课程建设SEWMGroupHow do we scal
2、e up?Source:Wikipedia(IBM Roadrunner)Divide and Conquer“Work”w1w2w3r1r2r3“Result”“worker”“worker”“worker”PartitionCombineParallelization ChallengesHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate par
3、tial results?How do we know all the workers have finished?What if workers die?What is the common theme of all of these problems?Common Theme?Parallelization problems arise from:Communication between workers(e.g.,to exchange state)Access to shared resources(e.g.,data)Thus,we need a synchronization me
4、chanismSource:Ricardo Guimares HerrmannManaging Multiple WorkersDifficult becauseWe dont know the order in which workers runWe dont know when workers interrupt each otherWe dont know the order in which workers access shared dataThus,we need:Semaphores(lock,unlock)Conditional variables(wait,notify,br
5、oadcast)BarriersStill,lots of problems:Deadlock,livelock,race conditions.Dining philosophers,sleepy barbers,cigarette smokers.Moral of the story:be careful!Current ToolsProgramming modelsShared memory(pthreads)Message passing(MPI)Design PatternsMaster-slavesProducer-consumer flowsShared work queuesM
6、essage PassingP1P2P3P4P5Shared MemoryP1P2P3P4P5Memorymasterslavesproducer consumerproducer consumerwork queueWhere the rubber meets the roadConcurrency is difficult to reason aboutConcurrency is even more difficult to reason aboutAt the scale of datacenters(even across datacenters)In the presence of
7、 failuresIn terms of multiple interacting servicesNot to mention debuggingThe reality:Lots of one-off solutions,custom codeWrite you own dedicated library,then program with itBurden on the programmer to explicitly manage everythingSource:Wikipedia(Flat Tire)Source:MIT Open CoursewareSource:MIT Open
8、CoursewareSource:Harpers(Feb,2008)Whats the point?Its all about the right level of abstractionThe von Neumann architecture has served us well,but is no longer appropriate for the multi-core/cluster environmentHide system-level details from the developersNo more race conditions,lock contention,etc.Se
9、parating the what from howDeveloper specifies the computation that needs to be performedExecution framework(“runtime”)handles actual executionThe datacenter is the computer!“Big Ideas”Scale“out”,not“up”Limits of SMP and large shared-memory machinesMove processing to the dataCluster have limited band
10、widthProcess data sequentially,avoid random accessSeeks are expensive,disk throughput is reasonableSeamless scalabilityFrom the mythical man-month to the tradable machine-hourMapReducegggggfffffMapFoldRoots in Functional ProgrammingTypical Large-Data ProblemIterate over a large number of recordsExtr
11、act something of interest from eachShuffle and sort intermediate resultsAggregate intermediate resultsGenerate final outputKey idea:provide a functional abstraction for these two operationsMapReduce(Dean and Ghemawat,OSDI 2004)19MapReduceProgrammers specify two functions:map(k,v)*reduce(k,v)*lAll va
12、lues with the same key are sent to the same reducerThe execution framework handles everything else20mapmapmapmapShuffle and Sort:aggregate values by keysreducereducereducek1k2k3k4k5k6v1v2v3v4v5v6ba12cc36ac52bc78a15b27c2368r1s1r2s2r3s321MapReduceProgrammers specify two functions:map(k,v)*reduce(k,v)*
13、lAll values with the same key are sent to the same reducerThe execution framework handles everything elseWhats“everything else”?22MapReduce“Runtime”Handles schedulinglAssigns workers to map and reduce tasksHandles“data distribution”lMoves processes to dataHandles synchronizationlGathers,sorts,and sh
14、uffles intermediate dataHandles errors and faultslDetects worker failures and restartsEverything happens on top of a distributed FS(later)23MapReduceProgrammers specify two functions:map(k,v)*reduce(k,v)*lAll values with the same key are reduced togetherThe execution framework handles everything els
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 大规模 数据处理 计算
限制150内