《Google云计算解决方案3_元数据应用》由会员分享,可在线阅读,更多相关《Google云计算解决方案3_元数据应用(25页珍藏版)》请在金锄头文库上搜索。
1、GoogleClusterComputingFacultyTrainingWorkshop ModuleIII Nutch Meta details BuilttoencouragepublicsearchworkOpen source w pluggablemodulesCheaptorun bothmachines adminsGoal Searchmorepages withbetterquality thananyotherenginePrettygoodrankingHasdone 200Mpages morepossibleHadoopisaspinoff Outline Nutc
2、hdesignLinkdatabase fetcher indexer etc HadoopsupportDistributedfilesystem jobcontrol WebDB MovingParts AcquisitioncycleWebDBFetcherIndexgenerationIndexingLinkanalysis maybe Servingresults WebDB Containsinfoonallpages linksURL lastdownload failures linkscore contenthash refcountingSourcehash targetU
3、RLMustalwaysbeconsistentDesignedtominimizediskseeks19msseektimex200mnewpages mo 44daysofdiskseeks Single diskWebDBwashugeheadache Fetcher Fetcherisverystupid Nota crawler Pre MapRed divide to fetchlist intokpieces oneforeachfetchermachineURLsforonedomaingotosamelist otherwiserandom Politeness w oint
4、er fetcherprotocolsCanobserverobots txtsimilarlyBetterDNS robotscachingEasyparallelismTwooutputs pages WebDBedits 2 Sortedits externally ifnecessary WebDB FetcherUpdates WebDB Fetcheredits 1 Writedownfetcheredits 3 Readstreamsinparallel emittingnewdatabase 4 Repeatforothertables Indexing Iteratethro
5、ughallkpagesetsinparallel constructinginvertedindexCreatesa searchabledocument of URLtextContenttextIncominganchortextOthercontenttypesmighthaveadifferentdocumentfieldsEg emailhassender receiverAnysearchablefieldend userwillwantUsesLucenetextindexer Linkanalysis Apage srelevancedependsonbothintrinsi
6、candextrinsicfactorsIntrinsic pagetitle URL textExtrinsic anchortext linkgraphPageRankismostfamousofmanyOthersinclude HITSOPICSimpleincominglinkcountLinkanalysisissexy butimportancegenerallyoverstated Linkanalysis 2 NutchperformsanalysisinWebDBEmitascoreforeachknownpageAtindextime incorporatescorein
7、toinvertedindexExtremelytime consumingInourcase disk consuming too becausewewanttouselow memorymachines Fastandeasy 0 5 log incominglinks britney QueryProcessing Docs0 1M Docs1 2M Docs2 3M Docs3 4M Docs4 5M britney britney britney britney britney Ds1 29 Ds1 2M 1 7M Ds2 3M 2 9M Ds3 1M 3 2M Ds4 4M 4 5
8、M 1 2M 4 4M 29 AdministeringNutch AdmincostsarecriticalIt sahasslewhenyouhave25machinesGooglehas 100k probablymoreFilesWebDBcontent workingfilesFetchlists fetchedpagesLinkanalysisoutputs workingfilesInvertedindicesJobsEmitfetchlists fetch updateWebDBRunlinkanalysisBuildinvertedindices AdministeringN
9、utch 2 Adminsoundsboring butit snot ReallyIswearLarge filemaintenanceGoogleFileSystem Ghemawat Gobioff Leung NutchDistributedFileSystemJobControlMap Reduce DeanandGhemawat Pig YahooResearch DataStorage BigTable NutchDistributedFileSystem Similar butnotidentical toGFSRequirementsarefairlystrangeExtre
10、melylargefilesMostfilesreadonce fromstarttoendLowadmincostsperGBEquallystrangedesignWrite once withdeleteSinglefilecanexistacrossmanymachinesWhollyautomaticfailurerecovery NDFS 2 DatadividedintoblocksBlockscanbecopied replicatedDatanodesholdandserveblocksNamenodeholdsmetainfoFilename blocklistBlock
11、datanode locationDatanodesreportintonamenodeeveryfewseconds NDFSFileRead Namenode Datanode0 Datanode1 Datanode2 Datanode3 Datanode4 Datanode5 ClientasksdatanodeforfilenameinfoNamenoderespondswithblocklist andlocation s foreachblockClientfetcheseachblock insequence fromadatanode crawl txt block 33 da
12、tanodes1 4 block 95 datanodes0 2 block 65 datanodes1 4 5 NDFSReplication Namenode Datanode0 33 95 Datanode1 46 95 Datanode2 33 104 Datanode3 21 33 46 Datanode4 90 Datanode5 21 90 104 AlwayskeepatleastkcopiesofeachblkImaginedatanode4dies blk90lostNamenodelosesheartbeat decrementsblk90 sreferencecount
13、 Asksdatanode5toreplicateblk90todatanode0Choosingreplicationtargetistricky Blk90todn0 Map Reduce Map ReduceisprogrammingmodelfromLisp andotherplaces EasytodistributeacrossnodesNiceretry failuresemanticsmap key val isrunoneachiteminsetemitskey valpairsreduce key vals isrunforeachuniquekeyemittedbymap
14、 emitsfinaloutputManyproblemscanbephrasedthisway Map Reduce 2 Task countwordsindocsInputconsistsof url contents pairsmap key url val contents Foreachwordwincontents emit w 1 reduce key word values uniq counts Sumall 1 sinvalueslistEmitresult word sum Map Reduce 3 Task grepInputconsistsof url offset
15、singleline map key url offset val line Ifcontentsmatchesregexp emit line 1 reduce key line values uniq counts Don tdoanything justemitlineWecanalsodographinversion linkanalysis WebDBupdates etc Map Reduce 4 Howisthisdistributed Partitioninputkey valuepairsintochunks runmap tasksinparallelAfterallmap
16、 sarecomplete consolidateallemittedvaluesforeachuniqueemittedkeyNowpartitionspaceofoutputmapkeys andrunreduce inparallelIfmap orreduce fails reexecute Map ReduceJobProcessing JobTracker TaskTracker0 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 Clientsubmits grep job indicatingcodeandinputfilesJobTrackerbreaksinputfileintokchunks inthiscase6 Assignsworktottrackers Aftermap tasktrackersexchangemap outputtobuildreduce keyspaceJobTrackerbreaksreduce keyspaceintomchunks inthiscase