(精品)Module5-HadoopTechnicalReview.ppt
《(精品)Module5-HadoopTechnicalReview.ppt》由会员分享,可在线阅读,更多相关《(精品)Module5-HadoopTechnicalReview.ppt(76页珍藏版)》请在得力文库 - 分享文档赚钱的网站上搜索。
1、Google Cluster Computing Faculty Training Workshop Module V:Hadoop Technical Review Spinnaker Labs,Inc.OverviewHadoop Technical WalkthroughHDFSDatabasesUsing Hadoop in an Academic EnvironmentPerformance tips and other tools Spinnaker Labs,Inc.You Say,“tomato”Google calls it:Hadoop equivalent:MapRedu
2、ceHadoopGFSHDFSBigtableHBaseChubby(nothing yet but planned)Some MapReduce TerminologyJob A“full program”-an execution of a Mapper and Reducer across a data setTask An execution of a Mapper or a Reducer on a slice of data a.k.a.Task-In-Progress(TIP)Task Attempt A particular instance of an attempt to
3、execute a task on a machine Spinnaker Labs,Inc.Terminology ExampleRunning“Word Count”across 20 files is one job20 files to be mapped imply 20 map tasks+some number of reduce tasksAt least 20 map task attempts will be performed more if a machine crashes,etc.Spinnaker Labs,Inc.Task AttemptsA particula
4、r task will be attempted at least once,possibly more times if it crashesIf the same input causes crashes over and over,that input will eventually be abandonedMultiple attempts at one task may occur in parallel with speculative execution turned onTask ID from TaskInProgress is not a unique identifier
5、;dont use it that way Spinnaker Labs,Inc.MapReduce:High Level Spinnaker Labs,Inc.Node-to-Node CommunicationHadoop uses its own RPC protocolAll communication begins in slave nodesPrevents circular-wait deadlockSlaves periodically poll for“status”messageClasses must provide explicit serialization Spin
6、naker Labs,Inc.Nodes,Trackers,TasksMaster node runs JobTracker instance,which accepts Job requests from clientsTaskTracker instances run on slave nodesTaskTracker forks separate Java process for task instances Spinnaker Labs,Inc.Job DistributionMapReduce programs are contained in a Java“jar”file+an
7、XML file containing serialized program configuration optionsRunning a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code Wheres the data distribution?Spinnaker Labs,Inc.Data DistributionImplicit in design of MapReduce!All mappers are
8、equivalent;so map whatever data is local to a particular node in HDFSIf lots of data does happen to pile up on the same node,nearby nodes will map insteadData transfer is handled implicitly by HDFS Spinnaker Labs,Inc.Configuring With JobConfMR Programs have many configurable optionsJobConf objects h
9、old(key,value)components mapping String ae.g.,“mapred.map.tasks”20JobConf is serialized and distributed before running the jobObjects implementing JobConfigurable can retrieve elements from a JobConf Spinnaker Labs,Inc.What Happens In MapReduce?Depth First Spinnaker Labs,Inc.Job Launch Process:Clien
10、tClient program creates a JobConfIdentify classes implementing Mapper and Reducer interfaces JobConf.setMapperClass(),setReducerClass()Specify inputs,outputsJobConf.setInputPath(),setOutputPath()Optionally,other options too:JobConf.setNumReduceTasks(),JobConf.setOutputFormat()Spinnaker Labs,Inc.Job
11、Launch Process:JobClientPass JobConf to JobClient.runJob()or submitJob()runJob()blocks,submitJob()does notJobClient:Determines proper division of input into InputSplitsSends job data to master JobTracker server Spinnaker Labs,Inc.Job Launch Process:JobTrackerJobTracker:Inserts jar and JobConf(serial
12、ized to XML)in shared location Posts a JobInProgress to its run queue Spinnaker Labs,Inc.Job Launch Process:TaskTrackerTaskTrackers running on slave nodes periodically query JobTracker for workRetrieve job-specific jar and configLaunch task in separate instance of Javamain()is provided by Hadoop Spi
13、nnaker Labs,Inc.Job Launch Process:TaskTaskTracker.Child.main():Sets up the child TaskInProgress attemptReads XML configurationConnects back to necessary MapReduce components via RPCUses TaskRunner to launch user process Spinnaker Labs,Inc.Job Launch Process:TaskRunnerTaskRunner,MapTaskRunner,MapRun
14、ner work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it should be mappingCalls Mapper once for each record retrieved from the InputSplitRunning the Reducer is much the same Spinnaker Labs,Inc.Creating the MapperYou provide the instance of MapperShould extend Map
15、ReduceBaseOne instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgressExists in separate process from all other instances of Mapper no data sharing!Spinnaker Labs,Inc.Mappervoid map(WritableComparable key,Writable value,OutputCollector output,Reporter reporter)Spinnaker Labs,
16、Inc.What is Writable?Hadoop defines its own“box”classes for strings(Text),integers(IntWritable),etc.All values are instances of WritableAll keys are instances of WritableComparable Spinnaker Labs,Inc.Writing For Cache Coherencywhile(more input exists)myIntermediate=new intermediate(input);myIntermed
17、iate.process();export outputs;Spinnaker Labs,Inc.Writing For Cache CoherencymyIntermediate=new intermediate(junk);while(more input exists)myIntermediate.setupState(input);myIntermediate.process();export outputs;Spinnaker Labs,Inc.Writing For Cache CoherencyRunning the GC takes timeReusing locations
18、allows better cache usageSpeedup can be as much as two-foldAll serializable types must be Writable anyway,so make use of the interface Spinnaker Labs,Inc.Getting Data To The MapperReading DataData sets are specified by InputFormatsDefines input data(e.g.,a directory)Identifies partitions of the data
19、 that form an InputSplitFactory for RecordReader objects to extract(k,v)records from the input source Spinnaker Labs,Inc.FileInputFormat and FriendsTextInputFormat Treats each n-terminated line of a file as a valueKeyValueTextInputFormat Maps n-terminated text lines of“k SEP v”SequenceFileInputForma
20、t Binary file of(k,v)pairs with some addl metadataSequenceFileAsTextInputFormat Same,but maps(k.toString(),v.toString()Spinnaker Labs,Inc.Filtering File InputsFileInputFormat will read all files out of a specified directory and send them to the mapperDelegates filtering this file list to a method su
21、bclasses may overridee.g.,Create your own“xyzFileInputFormat”to read*.xyz from directory list Spinnaker Labs,Inc.Record ReadersEach InputFormat provides its own RecordReader implementationProvides(unused?)capability multiplexingLineRecordReader Reads a line from a text fileKeyValueRecordReader Used
22、by KeyValueTextInputFormat Spinnaker Labs,Inc.Input Split SizeFileInputFormat will divide large files into chunksExact size controlled by mapred.min.split.size RecordReaders receive file,offset,and length of chunkCustom InputFormat implementations may override split size e.g.,“NeverChunkFile”Spinnak
23、er Labs,Inc.Sending Data To ReducersMap function receives OutputCollector objectOutputCollector.collect()takes(k,v)elementsAny(WritableComparable,Writable)can be used Spinnaker Labs,Inc.WritableComparatorCompares WritableComparable dataWill call WritableCpare()Can provide fast path for serialized da
24、taJobConf.setOutputValueGroupingComparator()Spinnaker Labs,Inc.Sending Data To The ClientReporter object sent to Mapper allows simple asynchronous feedbackincrCounter(Enum key,long amount)setStatus(String msg)Allows self-identification of inputInputSplit getInputSplit()Spinnaker Labs,Inc.Partition A
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 精品 Module5 HadoopTechnicalReview
限制150内