mahout下载和安装.docx－金锄头文库

资源描述

《mahout下载和安装.docx》由会员分享，可在线阅读，更多相关《mahout下载和安装.docx（4页珍藏版）》请在金锄头文库上搜索。

1、1：下载地址http:/ mahout-0.3.tar.gz 17-Mar-2010 02:12 47M2：解压tar -xvf mahout-0.3.tar.gz3:配置环境export HADOOP_HOME=/home/hadoopuser/hadoop-0.19.2 export HADOOP_CONF_DIR=/home/hadoopuser/hadoop-0.19.2/conf4：使用看看先bin/mahout -help会列出很多可以用的算法5：使用kmeans聚类看看先bin/mahout kmeans -input /user/hive/warehouse/tmp_data/

2、complex.seq -clusters5 -output/home/hadoopuser/1.txt关于 kmeans需要的参数等等通过如下命令可以查看：bin/mahout kmeans -helpmahout下处理的文件必须是SequenceFile格式的，所以需要把txtfile转换成sequenceFile。SequenceFile是hadoop中的一个类，允许我们向文件中写入二进制的键值对，具体介绍请看eyjian写的http:/ may find Tika (http:/lucene.apache.org/tika) helpful in converting binary d

3、ocuments to text.）使用方法如下：$MAHOUT_HOME/bin/mahout seqdirectory -input -output -c UTF-8|cp1252|ascii. -chunk 64 -prefix 举个例子：bin/mahout seqdirectory -input /hive/hadoopuser/ -output /mahout/seq/ -charset UTF-8运行kmeans的简单的例子：1：将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下$HADOOP_HOME/bin/hadoop fs -put testdata例如

4、：bin/hadoop fs -put /home/hadoopuser/mahout-0.3/test/synthetic_control.data/user/hadoopuser/testdata/2：使用kmeans算法$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job例如：bin/hadoop jar /home/hadoopuser/mah

5、out-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job3：使用canopy算法$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job例如：bin/hadoop jar /home/hadoopuser/mahout-0.3/mahou

6、t-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job4：使用dirichlet 算法$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job5：使用meanshift算法meanshift : $HADOOP_HOME/bin/hadoop jar $M

7、AHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job6：查看一下结果吧bin/mahout vectordump -seqFile /user/hadoopuser/output/data/part-00000这个直接把结果显示在控制台上。Get the data out of HDFSand have a look All example jobs use testdata as input and o

8、utput to directory outputUse bin/hadoop fs -lsr output to view all outputsOutput:KMeans is placed into output/pointsCanopy and MeanShift results are placed into output/clustered-points英文参考链接：http:/cwiki.apache.org/MAHOUT/syntheticcontroldata.htmlTriJUG: Intro to Mahout Slides and Demo examplesFirst

9、off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night.Also a big thank you to Red Hat for providing a most excellent meeting space.Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night ra

10、ffle.Overall, I think it went well, but thats not for me to judge.There were a lot of good questions and a good sized audience.The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides(Intro Mahout (PDF).For the “ugly demos”, below is a history of the commands I ran for setup, et

11、c.Keep in mind that you can almost always run bin/mahout help to get syntax help for any given command.Heres the preliminary setup stuff I did:1. Get and preprocess the Reuters content perhttp:/ Create the sequence files: bin/mahout seqdirectory input /content/reuters/reuters-out output /content/reu

12、ters/seqfiles charset UTF-83. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF weight (for LDA): bin/mahout seq2sparse input /content/reuters/seqfiles output /content/reuters/seqfiles-TF norm 2 weight TF4. Convert the Sequence Files to Sparse Vectors, using the Eucli

13、dean norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse input/content/reuters/seqfiles output /content/reuters/seqfiles-TF-IDF norm 2 weight TFIDFFor Latent Dirichlet Allocation I then ran:1. ./mahout lda input/content/reuters/seqfiles-TF/vectors/ output/content/reuters/seqfiles-TF/l

14、da-output numWords 34000 numTopics 202. ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics input /content/reuters/seqfiles-TF/lda-output/state-19 dict /content/reuters/seqfiles-TF/dictionary.file-0 words 10 output /content/reuters/seqfiles-TF/lda-output/topics dictionaryType sequencefileFor K-Means Clustering I ran:1. ./mahout kmeans input /content/

展开阅读全文