文档详情

基于AFOPT-tree的最大频繁项集挖掘

gg****m
实名认证
店铺
DOC
263.04KB
约32页
文档ID:217456445
基于AFOPT-tree的最大频繁项集挖掘_第1页
1/32

随着信息产业尤其是互联网行业的高速发展,使得人们获取和存储数据的能 力不断提高,数据库中存储的数据以指数形式不断增长但在这些海量数据中, 真正对于人们有决策价值的知识却相对匮乏,关联规则挖掘正是用于揭示数据集 中不同的项或者属性之间的关联性,找出有价值的多个属性之间关联关系而最 大频繁项集中隐含了所有频繁项集,占用的内存空间较小,并且在挖掘的过程中 只用挖掘最大频繁项集,可以有效的减少递归次数和内存应用,且有些数据挖掘 应用中也需要获得最大频繁项集,因而最大频繁项集挖掘的研究有着重要的意 义在如今面对大规模稠密数据项集时,超集检测逐渐成为最大频繁项集挖掘算 法运行中耗时最多的步骤,是算法效率提升的一个瓶颈;并且现有的最大频繁项 集挖掘算法大都采用基于FP-tree的模式对于空间搜索树进行遍历,在自顶向下 的遍历策略中效率并不高因此本文在查阅了大量国内外相关论文和文献的基础 上,针对这两方面的问题,本文改进了基于投影的超集检测算法,提出了基于 AFOPT-tree的最大频繁项集算法A-MFI,并在此基础上实现了 A-MFI算法在 Hadoop平台上的分布式实现论文的主要工作如下:(1) 首先对于数据挖掘尤其是关联规则挖掘和最人频繁项集挖掘的理论、 特点及其主流算法进行了介绍,并对云计算和Hadoop云平台的相关知识进行了 介绍。

2) 针对现有最大频繁项集挖掘算法采用的FP-tree在自顶向下遍历策略中 效率不高的问题,本文采用AFOPT-tree模型来构建空间搜索树;针对提升超集 检测方法效率的问题,本文提出优化的基于投影的超集检测方法,采用 AFOPT-tree模型对传统的MFI-tree进行改造,将基于投影超集检测方法对于 MFI-tree自底向上的遍历模式改变为自顶向下的遍历模式,并在MFI-tree中加入 一条相同数据项集之间的链表域,提升前瞻剪枝的效率在这些改进的基础上, 提出了基于AFOPT-tree的最大频繁项集挖掘算法A-MFI,并采用不同的数据项集对算法进行实验,验证了算法对比同类算法在超集检测优化和总体运行效率上 的优越性3) 针对面对如今大规模数据集,单机最大频繁项集挖掘算法的运行效率 提升有限的问题,木文在对云计算和Hadoop平台的相关知识深入学习的基础上, 对A-MFI算法进行了分布式改造,实现了对最大频繁项集挖掘的分布式挖掘 经实验验证,分布式的最大频繁项集挖掘方法相比单机在而对大规模稠密数据项 集时运行效率有了明显的提升4) 最后,对全文内容进行总结,并指出文中现有研宄内容的不足,为以 后的研究指明方向。

With the high-speed development of information industry,especially the Internet industry,peopled ability to obtain and store data continuously improve,and the data stored in the database is growing exponentially. But in these huge amounts of data, for people to have a valuable knowledge of the real decision making is relatively scarce,and association rule mining is used to reveal the data set different item or attribute, to find the valuable relationships and connections between multiple attributes.Maximum frequent item contains all the frequent items,takes up less memory space. Because of only need mining maximum frequent items,it can effectively reduce the number of recursion and memory applications,and some applications of data mining are also just need to get the maximum frequent items thus maximum frequent items mining research has important significance.Now in the face of large-scale dense data sets,the superset check gradually becomes one of the most time consuming steps in the operation of maximum frequent items mining algorithms,and becomes the bottleneck of algorithm efficiency.And the existing maximum frequent itemsets mining algorithms are mostly based on FP - tree model for spatial search tree traversal,in a top-down traversal strategy efficiency is not high. For these two problems, on consulting a large number of relevant papers and documents at home and abroad,this paper improved the superset checking method algorithm based on projection, puts forward the maximum frequent items algorithm A - MFI based on AFOPT - tree,and on the basis of achieved realize distributed implementation of A - MFI algorithm on Hadoop platform.The summary of work:1 .First of all, this paper introduced the theory,characteristics and the mainstream algorithm of data mining,association rule mining and maximum frequent items mining,and relevant knowledge of cloud computing and Hadoop cloud computing platform.2.In view of the problem of existing maximum frequent items mining algorithms using FP - tree in top-down traversal policy efficiency not high,this article adopts AFOPT - tree model to construct the spatial search tree.For the problem to ascension the efficiency of superset checking method,this paper puts forward optimized superset checking method based on projection, AFOPT - tree model is adopted to modify the traditional MFI - tree,superset checking method based on projection for MFI • tree traversal of the bottom-up model change for top-down traversal,and join in the MFI - tree a list domain between the same data item sets,promote foresight and pruning efficiency. On the basis of these improvements, based on AFOPT ■ tree this paper is proposed A - MFI maximum frequent items mining algorithms, and adopting different items of data set of experiments,the algorithm in superset detection algorithm compared with the similar algorithm is verified the superiority of the optimization and overall efficiency.3. For facing massive data sets, single maximum frequent items mining algorithm efficiency improve limited,on the basis of in-depth study the related knowledge of cloud computing and Hadoop platform,develop improved distributed algorithm for A - MFI,realized the distributed implementation of mining maximum frequent items. Verified by the experiment, the maximum frequent items mining method of distributed has obvious improvement than single machine running efficiency in the face of large-scale dense data item sets.Finally,summarize the content of the full text,and points out the shortage of the existing research content and direction for future research.第一章绪论1.1论文的背景中国互联网信息中心(CNNIC)在2013年7月发布的《32次互联网络发展 调查报告:P’w胃胃 中显示,截至2013年6/1底,我国的网民规模达到了 5.91 亿,网民的规模达到了 4.64亿,互联网当中的各种应用如及吋通信、搜索 引擎和网络新闻等使用率稳步提升。

信息产业尤其是网络方面的技术飞速发展, 使得人们获取和存储数据的能力不断提高,数据库中存储的数据以指数形式不断 增讼但在这些海量数据中,真正对于人们有决策价值的知识却相对匮乏,比如 数据分析、聚类分析以及数据与时间或数据与其它。

下载提示
相似文档
正为您匹配相似的精品文档