mahout下载和安装.docx

上传人:marr****208 文档编号:157279262 上传时间:2020-12-21 格式:DOCX 页数:4 大小:18.54KB
返回 下载 相关 举报
mahout下载和安装.docx_第1页
第1页 / 共4页
mahout下载和安装.docx_第2页
第2页 / 共4页
mahout下载和安装.docx_第3页
第3页 / 共4页
mahout下载和安装.docx_第4页
第4页 / 共4页
亲,该文档总共4页,全部预览完了,如果喜欢就下载吧!
资源描述

《mahout下载和安装.docx》由会员分享,可在线阅读,更多相关《mahout下载和安装.docx(4页珍藏版)》请在金锄头文库上搜索。

1、1:下载地址http:/ mahout-0.3.tar.gz 17-Mar-2010 02:12 47M2:解压tar -xvf mahout-0.3.tar.gz3:配置环境export HADOOP_HOME=/home/hadoopuser/hadoop-0.19.2 export HADOOP_CONF_DIR=/home/hadoopuser/hadoop-0.19.2/conf4:使用看看先bin/mahout -help会列出很多可以用的算法5:使用kmeans聚类看看先bin/mahout kmeans -input /user/hive/warehouse/tmp_data/

2、complex.seq -clusters5 -output/home/hadoopuser/1.txt关于 kmeans需要的参数等等通过如下命令可以查看:bin/mahout kmeans -helpmahout下处理的文件必须是SequenceFile格式的,所以需要把txtfile转换成sequenceFile。SequenceFile是hadoop中的一个类,允许我们向文件中写入二进制的键值对,具体介绍请看eyjian写的http:/ may find Tika (http:/lucene.apache.org/tika) helpful in converting binary d

3、ocuments to text.)使用方法如下:$MAHOUT_HOME/bin/mahout seqdirectory -input -output -c UTF-8|cp1252|ascii. -chunk 64 -prefix 举个例子:bin/mahout seqdirectory -input /hive/hadoopuser/ -output /mahout/seq/ -charset UTF-8运行kmeans的简单的例子:1:将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下$HADOOP_HOME/bin/hadoop fs -put testdata例如

4、:bin/hadoop fs -put /home/hadoopuser/mahout-0.3/test/synthetic_control.data/user/hadoopuser/testdata/2:使用kmeans算法$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job例如:bin/hadoop jar /home/hadoopuser/mah

5、out-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job3:使用canopy算法$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job例如:bin/hadoop jar /home/hadoopuser/mahout-0.3/mahou

6、t-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job4:使用dirichlet 算法$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job5:使用meanshift算法meanshift : $HADOOP_HOME/bin/hadoop jar $M

7、AHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job6:查看一下结果吧bin/mahout vectordump -seqFile /user/hadoopuser/output/data/part-00000这个直接把结果显示在控制台上。Get the data out of HDFSand have a look All example jobs use testdata as input and o

8、utput to directory outputUse bin/hadoop fs -lsr output to view all outputsOutput:KMeans is placed into output/pointsCanopy and MeanShift results are placed into output/clustered-points英文参考链接:http:/cwiki.apache.org/MAHOUT/syntheticcontroldata.htmlTriJUG: Intro to Mahout Slides and Demo examplesFirst

9、off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night.Also a big thank you to Red Hat for providing a most excellent meeting space.Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night ra

10、ffle.Overall, I think it went well, but thats not for me to judge.There were a lot of good questions and a good sized audience.The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides(Intro Mahout (PDF).For the “ugly demos”, below is a history of the commands I ran for setup, et

11、c.Keep in mind that you can almost always run bin/mahout help to get syntax help for any given command.Heres the preliminary setup stuff I did:1. Get and preprocess the Reuters content perhttp:/ Create the sequence files: bin/mahout seqdirectory input /content/reuters/reuters-out output /content/reu

12、ters/seqfiles charset UTF-83. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF weight (for LDA): bin/mahout seq2sparse input /content/reuters/seqfiles output /content/reuters/seqfiles-TF norm 2 weight TF4. Convert the Sequence Files to Sparse Vectors, using the Eucli

13、dean norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse input/content/reuters/seqfiles output /content/reuters/seqfiles-TF-IDF norm 2 weight TFIDFFor Latent Dirichlet Allocation I then ran:1. ./mahout lda input/content/reuters/seqfiles-TF/vectors/ output/content/reuters/seqfiles-TF/l

14、da-output numWords 34000 numTopics 202. ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics input /content/reuters/seqfiles-TF/lda-output/state-19 dict /content/reuters/seqfiles-TF/dictionary.file-0 words 10 output /content/reuters/seqfiles-TF/lda-output/topics dictionaryType sequencefileFor K-Means Clustering I ran:1. ./mahout kmeans input /content/

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 高等教育 > 其它相关文档

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号