1、Abstract中国科学技术大学硕士学位论文互联网中的海量用户行为挖掘算法研究作者姓名: 周津学科专业: 信号与信息处理导师姓名: 俞能海 教授 完成时间: 二一一年五月五日University of Science and Technology of ChinaA dissertation for masters degreeResearch on Large-Scale Mining Algorithms of User Behaviors in InternetAuthors Name: JinZhouSpeciality: Signal and Information Processi

2、ngSupervisor: Prof. Nenghai YuFinished time: 5th May, 2011中国科学技术大学学位论文原创性声明本人声明所呈交的学位论文,是本人在导师指导下进行研究工作所取得的成果。除已特别加以标注和致谢的地方外,论文中不包含任何他人已经发表或撰写过的研究成果。与我一同工作的同志对本研究所做的贡献均已在论文中作了明确的说明。作者签名:_ 签字日期:_中国科学技术大学学位论文授权使用声明作为申请学位的条件之一,学位论文著作权拥有者授权中国科学技术大学拥有学位论文的部分使用权,即:学校有权按有关规定向国家有关部门或机构送交论文的复印件和电子版,允许论文被查阅和

3、借阅,可以将学位论文编入中国学位论文全文数据库等有关数据库进行检索,可以采用影印、缩印或扫描等复制手段保存、汇编学位论文。本人提交的电子文档的内容和纸质论文的内容相一致。保密的学位论文在解密后也遵守此规定。公开 保密(_年)作者签名:_导师签名:_签字日期:_签字日期:_摘 要随着计算机技术以及互联网的飞速发展,在Web中产生了越来越多的基于用户的应用,这些应用数年来收集了海量的用户行为数据,且数据还正以指数级增长,这些海量数据中包含了大量和用户相关的信息。及时、精确地从这些海量用户信息中发现有用的知识,挖掘出这些数据背后隐藏的用户行为模式,能够帮助互联网应用提供更好的用户体验,并提高企业的市

4、场竞争力。本文采用数据挖掘的方法对互联网中的用户行为进行分析挖掘,找出其中隐藏的规律与模式。并从基于Web2.0的社会化标记系统中的用户标记行为分析和互联网搜索引擎中的用户检索行为分析两个方面进行说明。论文主要研究工作和创新成果如下:1. 调研分析当前互联网社会化标记系统和搜索引擎中的用户行为,并总结现阶段互联网用户行为挖掘的研究内容与主要成果。2. 提出一种基于对象特征向量表示法的标签聚类算法。该算法充分考虑标签的标记信息,采用特征向量来精确表征一个标签,根据余弦相似度公式得到较为准确的标签相似度,然后采用K-Means算法将用户标签进行聚类。实验结果表明该算法能够得到比较精确的用户标签聚类

5、结果。可以有效解决社会化标记系统中用户标签组织混乱、标签语义模糊以及信息描述不精确的问题。最后将该算法应用于中国科学技术大学“图书馆交互式科研管理平台”证明其实用性。3. 提出一种基于倒排表查询和MapReduce的分布式K-Means聚类算法。该算法的提出是为了解决搜索引擎中海量用户行为分析挖掘的问题。首先针对搜索引擎中的用户行为特点,采用三部图模型对其进行建模,并采用特征向量来表征用户输入的查询词;然后利用提出的算法对海量的用户查询词进行聚类。实验证明该算法能够很好地应对海量用户查询词聚类的问题,并且在规模数据集下表现出高效的性能。最后再根据实验得到的聚类结果分析当前互联网搜索引擎中用户行

6、为的特点,为搜索引擎提供更好的用户检索体验提供帮助。关键词:特征向量 数据挖掘 用户行为分析 K-Means 分布式 MapReduceABSTRACTWith the fast development of the computer technology and Internet, more and more applications based on users are generated in the Web. These applications have collected massive user behavior data for several years, and the d

7、ata are growing exponentially. The massive data contain large amounts of information about users. It can help Internet applications provide betted user experience, and improve companys market competitiveness, if we could find useful knowledge from the massive user information, and get the user behav

8、ior patterns behind these data. In this dissertation, we analyze and study the user behaviors in Internet using the data mining method, and find the hidden regular patterns and models. We carry out our researches in two aspects: the analysis of user tagging behavior in the social tagging system base

9、d on Web 2.0; and the analysis of user querying behavior in the search engine in Internet.1. Investigated and analyzed user behaviors in social tagging systems and search engine in Internet, then summarized the research and main achievements of current Internet user behaviors mining.2. Proposed a ne

10、w tag clustering algorithm using object-based feature vector. This algorithm considers the tags marking information sufficiently, uses an object-based feature vector to characterize a single tag. This feature vector can represent a tag exactly and can get a more accurate similarity between two tags

11、by using cosine similarity formula. K-Means algorithm is used to cluster the users tags. The experiment shows that the algorithm proposed in this dissertation can get an accurate clustering result, and can solve many problems in social tagging systems such as mess of tag organization, confusion of t

12、ag semantic meanings and imprecision of information description. At last, we apply this algorithm to the “Library Interactive System for Education and Research” system in our university to approve this algorithms practicability.3. Proposed a new distributed K-Means clustering algorithm based on inve

13、rted tables query and MapReduce. This algorithms proposition is to solve the problem of user behaviors mining in search engine. First based on the regular pattern of user behaviors in search engine, we use a tripartite graph to model the user behavior, and use a feature vector to characterize user i

14、nput queries. Then the proposed algorithm is utilized to cluster the massive user queries. The experiment shows that this algorithm can handle the clustering problem of massive user queries well, and demonstrate effective performance in large-scale data set. At last, we analyze the characteristic of

15、 user behaviors in current search engine based on the clustering result, and the result is helpful for search engine to improve the user querying experience.Keywords: feature vector, data mining, user behavior analysis, K-Means, distributed system, MapReduceII目 录目 录摘 要IABSTRACTIII目 录V第1章 绪 论11.1 研究背景与意义11.2 研究现状21.2.1 社会化标记系统研究现状21.2.2 搜索引擎研究现状31.3 本文工作41.4 本文结构安排4第2章 用户行为挖掘与分布式运算背景知识72.1 数据挖掘以及Web挖掘基础知识72.1.1 数据挖掘概述72.1.



