《HADOOP - 数据挖掘研究组》由会员分享,可在线阅读,更多相关《HADOOP - 数据挖掘研究组(38页珍藏版)》请在金锄头文库上搜索。
1、 HadoopIntroducing Installation and Configuration数据挖掘研究组 Data Mining Group Xiamen UniversityA Distributed data-intensive Programming FrameworkHDFSMapReduceHadoopDistributed Distributed storagestorageParallel computingParallel computing数据挖掘研究组 Data Mining Group Xiamen UniversityIntroducing to HDFSHad
2、oop Distributed File System (HDFS)An open-source implementation of GFS has many similarities with distributed file systems. However, comes differences with it. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data
3、and is suitable for applications that have large data sets.数据挖掘研究组 Data Mining Group Xiamen UniversityHow it works?Features of itAn important feature of the design An important feature of the design : :data is never moved through the data is never moved through the namenodenamenode. .Instead, Instea
4、d, all all data transferoccurs directly data transferoccurs directly between between clients clients and datanodesand datanodes数据挖掘研究组 Data Mining Group Xiamen UniversityMapReduce?Lets talk it next timeLets talk it next time数据挖掘研究组 Data Mining Group Xiamen University“Running Hadoop?”What means for i
5、t?“Running Hadoop” means running a set of daemons.NameNodeDataNodeSecondary NameNodeJobTrackerTaskTracker数据挖掘研究组 Data Mining Group Xiamen UniversityWho Works for who?HDFSMapReduceHadoopNameNodeSec NDTaskTrackerJobTrackerDataNodeNameNodeHadoop employs a master/slave architecture for Hadoop employs a
6、master/slave architecture for both distributed storage and both distributed storage and distributed distributed putation.NameNodeNameNode is the master of HDFS that directs the is the master of HDFS that directs the slave slave DataNodeDataNode daemons to perform the low- daemons to perform the low-
7、 level I/O taskslevel I/O tasksNameNodeNameNode is the bookkeeper of HDFS is the bookkeeper of HDFSkeeps track of how your files are broken down keeps track of how your files are broken down into file blocksinto file blockskeeps track of the overall health of the keeps track of the overall health of
8、 the distributed fidistributed filesystemlesystemDataNodereading and writing HDFS blocks for clientsreading and writing HDFS blocks for clientscommunicate with other communicate with other DataNodesDataNodes to to replicate its data blocks for redundancyreplicate its data blocks for redundancy数据挖掘研究
9、组 Data Mining Group Xiamen UniversityNameNode and DataNodeSecondary NameNodeSNN is an assistant daemon for monitoring SNN is an assistant daemon for monitoring the state of the cluster HDFSthe state of the cluster HDFSdiffers from the differs from the NameNodeNameNode in that this in that this proce
10、ss doesnt receive or record any real-time process doesnt receive or record any real-time changes to HDFSchanges to HDFScommunicates with the communicates with the NameNodeNameNode to take to take snapshots of the HDFS metadatasnapshots of the HDFS metadataRecovery:Recovery:NameNodeNameNode failure ?
11、 failure ?We reconfigure the cluster to use the SNN as We reconfigure the cluster to use the SNN as the primary the primary NameNodeNameNodeJobTrackerthe liaison between your application and the liaison between your application and HadoopHadoopsubmit your code to your cluster, the submit your code t
12、o your cluster, the JobTrackerJobTracker determines the execution plan determines the execution plandetermining which files to processdetermining which files to processassigns nodes to different tasksassigns nodes to different tasksmonitors all tasks as theyre running monitors all tasks as theyre ru
13、nning a task fail?a task fail?JobTrackerJobTracker will will relaunchrelaunch the task on a different the task on a different nodenodeTaskTrackerEach Each TaskTrackerTaskTracker is responsible for is responsible for executing the individual tasks that the executing the individual tasks that the JobT
14、rackerJobTracker assigns assigns数据挖掘研究组 Data Mining Group Xiamen UniversityJobTracker and TaskTrackerInstallation and ConfigurationPseudo-distributed modePseudo-distributed modeAll All daemons run on daemons run on onon the machine the machineFully distributed modeFully distributed modeWhat Differen
15、t?What Different?数据挖掘研究组 Data Mining Group Xiamen UniversityInstallation forPseudo-distributed modePrerequisitesPrerequisitesUbuntu LinuxUbuntu LinuxHadoop 0.20.2Hadoop 0.20.2Sun Java 6Sun Java 6$ $sudosudo add-apt-repository “deb http:/ lucid partner“ add-apt-repository “deb http:/ lucid partner“ $ $sudosudo apt-get update apt-get update $ $sudosudo apt-get install sun-java6-jdk apt-get install sun-java6-jdk数据挖掘研究组 Data Mining Group Xiamen UniversityConfiguring SSHHadoop requires SSH access to manage its nodes, remote machines plus your local machine if yo