文本挖掘简介

上传人:hs****ma 文档编号:568678618 上传时间:2024-07-26 格式:PPT 页数:15 大小:887.50KB
返回 下载 相关 举报
文本挖掘简介_第1页
第1页 / 共15页
文本挖掘简介_第2页
第2页 / 共15页
文本挖掘简介_第3页
第3页 / 共15页
文本挖掘简介_第4页
第4页 / 共15页
文本挖掘简介_第5页
第5页 / 共15页
点击查看更多>>
资源描述

《文本挖掘简介》由会员分享,可在线阅读,更多相关《文本挖掘简介(15页珍藏版)》请在金锄头文库上搜索。

1、文本挖掘简介Stillwatersrundeep.流静水深流静水深,人静心深人静心深Wherethereislife,thereishope。有生命必有希望。有生命必有希望OutlineIntroductionTF-IDFSimilarityIntroductionWhy?Text mining Web miningHow?Classification or ClusteringRetrieval文本分类一般过程文本分类一般过程预处理将文档集表示成易于计算机处理的形式 特征表示与选择、降维根据适宜的权重计算方法表示文档中各项的重要性 学习建模 构建分类器文本分类预处理文本分类预处理去标点、多余

2、空格、数字(可选)大小写统一去停用词(stop words)没有实际含义的词,比如and, you, have等等统一词根PorterStemmer分词英文?中文特征表示特征表示向量空间模型以词项为特征组成高维特征向量TF/IDF得到权值TF-IDFTF(Term Frequency)表示词项频率IDF(Inverse Document Frequency)逆文档频率TF*IDF值8Similarity Applications Many Web-mining problems can be expressed as finding “similar” sets: Plagiarism/Mir

3、ror Pages/Articles from the Same Source/Duplication RemoveCollaborative Filtering as a Similar-Sets ProblemRecommend to users items that were liked by other users who have exhibited smilar tastesMeasurementEdit distanceShort text, wordsFor personal textJaccard distanceLong text, ignoring the word si

4、milarityFor government textMicrosoft Academic SearchPKhttp:/ Data is Rather Dirty!Kenneth De JongKenneth Dejong2024/7/26Trie-Join VLDB201010/38Typo in “author”Typo in “title”relaxed related Argyrios Zymnis Argyris ZymnisDBLP Complete Search2024/7/26Real-world Data is Rather Dirty!Trie-Join VLDB20101

5、1/38The similarity join is an essential operation for data integration and cleaningPerform a similarity join on Name attribute (find all record pairs whose Name attributes are similar)pOutput: (2037349, 3054641), Similarity JoinsRIdNameUniv.2037349Kenneth De JongGeorge 3054641Kenneth DejongGeorge 20

6、24/7/26Trie-Join VLDB201012/38Near Duplicate DataOn one end, a winded Pete Sampras tried to summon enough energy to give the New York fans another memorable win to talk about it on the subway ride home. On the other side, Roger Federer wore a sly grin like he knew age was about to catch up to the fo

7、rmer world No. 1 - the man who owns the record of 14 Grand Slams he wants.03/11/2008 | 11:28 AMBy JAY COHEN, AP Sports Writer Mar 11, 4:23 am EDTSimilarity JoinTokenize:Each record is a set of tokens from a finite universe.Suppose each record is a single text documentx = “yes as soon as possible”y =

8、 “as soon as possible please”x = A, B, C, D, Ey = B, C, D, E, Fwordyesassoonas1possbilepleasetokenABCDEF参考文献参考文献Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008. Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng. Pass-Join: A Partition based Method for Similarity Joins. VLDB 2012.

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 建筑/环境 > 施工组织

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号