Google云计算解决方案3_元数据应用

上传人:我*** 文档编号:134417832 上传时间:2020-06-05 格式:PPT 页数:25 大小:274KB
返回 下载 相关 举报
Google云计算解决方案3_元数据应用_第1页
第1页 / 共25页
Google云计算解决方案3_元数据应用_第2页
第2页 / 共25页
Google云计算解决方案3_元数据应用_第3页
第3页 / 共25页
Google云计算解决方案3_元数据应用_第4页
第4页 / 共25页
Google云计算解决方案3_元数据应用_第5页
第5页 / 共25页
点击查看更多>>
资源描述

《Google云计算解决方案3_元数据应用》由会员分享,可在线阅读,更多相关《Google云计算解决方案3_元数据应用(25页珍藏版)》请在金锄头文库上搜索。

1、GoogleClusterComputingFacultyTrainingWorkshop ModuleIII Nutch Meta details BuilttoencouragepublicsearchworkOpen source w pluggablemodulesCheaptorun bothmachines adminsGoal Searchmorepages withbetterquality thananyotherenginePrettygoodrankingHasdone 200Mpages morepossibleHadoopisaspinoff Outline Nutc

2、hdesignLinkdatabase fetcher indexer etc HadoopsupportDistributedfilesystem jobcontrol WebDB MovingParts AcquisitioncycleWebDBFetcherIndexgenerationIndexingLinkanalysis maybe Servingresults WebDB Containsinfoonallpages linksURL lastdownload failures linkscore contenthash refcountingSourcehash targetU

3、RLMustalwaysbeconsistentDesignedtominimizediskseeks19msseektimex200mnewpages mo 44daysofdiskseeks Single diskWebDBwashugeheadache Fetcher Fetcherisverystupid Nota crawler Pre MapRed divide to fetchlist intokpieces oneforeachfetchermachineURLsforonedomaingotosamelist otherwiserandom Politeness w oint

4、er fetcherprotocolsCanobserverobots txtsimilarlyBetterDNS robotscachingEasyparallelismTwooutputs pages WebDBedits 2 Sortedits externally ifnecessary WebDB FetcherUpdates WebDB Fetcheredits 1 Writedownfetcheredits 3 Readstreamsinparallel emittingnewdatabase 4 Repeatforothertables Indexing Iteratethro

5、ughallkpagesetsinparallel constructinginvertedindexCreatesa searchabledocument of URLtextContenttextIncominganchortextOthercontenttypesmighthaveadifferentdocumentfieldsEg emailhassender receiverAnysearchablefieldend userwillwantUsesLucenetextindexer Linkanalysis Apage srelevancedependsonbothintrinsi

6、candextrinsicfactorsIntrinsic pagetitle URL textExtrinsic anchortext linkgraphPageRankismostfamousofmanyOthersinclude HITSOPICSimpleincominglinkcountLinkanalysisissexy butimportancegenerallyoverstated Linkanalysis 2 NutchperformsanalysisinWebDBEmitascoreforeachknownpageAtindextime incorporatescorein

7、toinvertedindexExtremelytime consumingInourcase disk consuming too becausewewanttouselow memorymachines Fastandeasy 0 5 log incominglinks britney QueryProcessing Docs0 1M Docs1 2M Docs2 3M Docs3 4M Docs4 5M britney britney britney britney britney Ds1 29 Ds1 2M 1 7M Ds2 3M 2 9M Ds3 1M 3 2M Ds4 4M 4 5

8、M 1 2M 4 4M 29 AdministeringNutch AdmincostsarecriticalIt sahasslewhenyouhave25machinesGooglehas 100k probablymoreFilesWebDBcontent workingfilesFetchlists fetchedpagesLinkanalysisoutputs workingfilesInvertedindicesJobsEmitfetchlists fetch updateWebDBRunlinkanalysisBuildinvertedindices AdministeringN

9、utch 2 Adminsoundsboring butit snot ReallyIswearLarge filemaintenanceGoogleFileSystem Ghemawat Gobioff Leung NutchDistributedFileSystemJobControlMap Reduce DeanandGhemawat Pig YahooResearch DataStorage BigTable NutchDistributedFileSystem Similar butnotidentical toGFSRequirementsarefairlystrangeExtre

10、melylargefilesMostfilesreadonce fromstarttoendLowadmincostsperGBEquallystrangedesignWrite once withdeleteSinglefilecanexistacrossmanymachinesWhollyautomaticfailurerecovery NDFS 2 DatadividedintoblocksBlockscanbecopied replicatedDatanodesholdandserveblocksNamenodeholdsmetainfoFilename blocklistBlock

11、datanode locationDatanodesreportintonamenodeeveryfewseconds NDFSFileRead Namenode Datanode0 Datanode1 Datanode2 Datanode3 Datanode4 Datanode5 ClientasksdatanodeforfilenameinfoNamenoderespondswithblocklist andlocation s foreachblockClientfetcheseachblock insequence fromadatanode crawl txt block 33 da

12、tanodes1 4 block 95 datanodes0 2 block 65 datanodes1 4 5 NDFSReplication Namenode Datanode0 33 95 Datanode1 46 95 Datanode2 33 104 Datanode3 21 33 46 Datanode4 90 Datanode5 21 90 104 AlwayskeepatleastkcopiesofeachblkImaginedatanode4dies blk90lostNamenodelosesheartbeat decrementsblk90 sreferencecount

13、 Asksdatanode5toreplicateblk90todatanode0Choosingreplicationtargetistricky Blk90todn0 Map Reduce Map ReduceisprogrammingmodelfromLisp andotherplaces EasytodistributeacrossnodesNiceretry failuresemanticsmap key val isrunoneachiteminsetemitskey valpairsreduce key vals isrunforeachuniquekeyemittedbymap

14、 emitsfinaloutputManyproblemscanbephrasedthisway Map Reduce 2 Task countwordsindocsInputconsistsof url contents pairsmap key url val contents Foreachwordwincontents emit w 1 reduce key word values uniq counts Sumall 1 sinvalueslistEmitresult word sum Map Reduce 3 Task grepInputconsistsof url offset

15、singleline map key url offset val line Ifcontentsmatchesregexp emit line 1 reduce key line values uniq counts Don tdoanything justemitlineWecanalsodographinversion linkanalysis WebDBupdates etc Map Reduce 4 Howisthisdistributed Partitioninputkey valuepairsintochunks runmap tasksinparallelAfterallmap

16、 sarecomplete consolidateallemittedvaluesforeachuniqueemittedkeyNowpartitionspaceofoutputmapkeys andrunreduce inparallelIfmap orreduce fails reexecute Map ReduceJobProcessing JobTracker TaskTracker0 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 Clientsubmits grep job indicatingcodeandinputfilesJobTrackerbreaksinputfileintokchunks inthiscase6 Assignsworktottrackers Aftermap tasktrackersexchangemap outputtobuildreduce keyspaceJobTrackerbreaksreduce keyspaceintomchunks inthiscase

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 办公文档 > PPT模板库 > PPT素材/模板

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号