基于负频繁项集的负关联规则挖掘研究

资源描述

《基于负频繁项集的负关联规则挖掘研究》由会员分享，可在线阅读，更多相关《基于负频繁项集的负关联规则挖掘研究（55页珍藏版）》请在金锄头文库上搜索。

1、山东轻工业学院硕士学位论文基于负频繁项集的负关联规则挖掘研究姓名：马亮申请学位级别：硕士专业：计算机应用技术指导教师：董祥军 2010-06-10 山东轻工业学院硕士学位论文 I 摘要摘要近年来，随着微型机以及网络的普及和数据存储技术的发展，很多领域的数据库中都可储备了海量数据，通过利用数据挖掘工具来分析和进一步理解储备的数据，发现数据中后面的有用知识成为目前计算机领域中最为活跃的一个研究领域。显然，位列其中的关联规则挖掘是一个重要分支，具有相当重要的价值和十分广泛的领域应用前景。关联规则有正、负关联规则。目前而言，正关联规则研究已经受到了研究人员相当的关注，而对

2、含负项的负规则的研究力度仍然不够。然而，在相当多研究领域中，事物否定因素也可以用来作为重要信息来源有着相当重要的作用，因此为了达到更客观决策的目的，我们完全有必要对负属性的关联进行研究。本文基于正关联规则和对负关联规则定义的修改，提出了关联模式表达式的左端或者右端或者左右端含有正、负混合项关联规则。现有负关联规则及算法数目不够多，而且大都基于 Apriori 思想的算法，需要对我们存储的数据集进行扫描多次，也会生成大量候选项集。本文提出了新方法，用来从正频繁项集中挖掘负频繁项集的算法，即 e-NFIS 算法。为了得到正频繁项集，我们借用 FP_growth 算法，利用

3、这个算法中频繁模式树压缩存储数据结构，然后基于容斥原理的公式来计算挖掘出研究所需的含负项目的频繁项集。基于该算法的基本思想，算法具有了避免多次扫描数据库和生成大量候选项集的优点。在时间和空间的开销上跟目前的大多数据挖掘算法相比都具有一定的优势。实验证明，算法具有很好的效率。另外，论文对现有的研究含正、负混合项的负关联规则算法存在的问题，进行了探讨，在对目前算法分析的基础上。提出了如何将目前关联模式的一边或者两边含有正、负混合项负关联规则中出现的矛盾关联规则进行过滤，提出了正相关情况下有效选取关联规则的方法。论文另外对含负项的负关联规则的矛盾性进行了讨论。用例子证明，论文提

4、出的改进方法是正确有效的。关键词：关键词：正频繁项集，负频繁项集，关联规则山东轻工业学院硕士学位论文 I ABSTRACT In recent years, with bright and popularization of Internet and data storage technology development, many areas of database can reserve mass data, through utilizing data mining tools for analyzing and further understanding history data

5、 in the data, finding the useful knowledge behind data in computer field become the most active a research field. Obviously, among them association rule mining as one of these important fields, have very important value and very widely application prospect. Association rules are the correlation betw

6、een data itemsets and other data itemsets, it contains positive association rules, negative correlation rules. At present, association rules research are given widespread attention, and the rules which includ positive and negative projects at the same time in negative association rules are not given

7、 enough attention. However, in many applications, the negative factors things is also very important information sources, therefore, it is necessary to study the relationship between negative attributes. Based on the traditional association rules and the progress of negative association rules, we pu

8、ts forward the mixture patterns definition which contain positive and negative itemsets in association model right or left or left and rught. At present, the existing mining negative association rules are few, and these kinds of negative association rules algorithm essentially based on Apriori algor

9、ithm thought, besides,these methods need generally multiple scanning the candidate of frequent itemsets. This paper proposes a new method used to from frequent items, centralized mining negative frequent itemsets algorithm, namely e-NFIS algorithm. To get positive frequent itemsets, this algorithm u

10、ses FP_growth algorithm and FP-tree compressing storage data structure, then based on the principle of permutations formula for calculating out the negative frequent itemsets. As the basic principles, methods, it can be avoided to multiple scan databases and form a shorter candidate itemsets than ot

11、hers methods. The cost in time and space to current mostly methods on data mining algorithms have certain advantages. Experiments show that the algorithm are very efficient. In addition, at present, the problems in the existing research papers with positive and negative itemsets in one side of assoc

12、iation model or two of association model are discussed by author in this paper. On the basis of the current methods, we make further analysis. Faceing to contrary association rules, we puts forward the filtering methods to effectively choose association rule. Paper discussed the contrary rules. At l

13、ast, this paper puts forward improvement method is correct and effective by examples. 山东轻工业学院硕士学位论文 II Keywods: positive frequent itemsets; negative frequent itemsets; association rules; 山东轻工业学院硕士学位论文学位论文独创性声明本人声明，所呈交的学位论文系在导师指导下本人独立完成的研究成果。文中引用他人的成果，均已做出明确标注或得到许可。论文内容未包含法律意义上已属于他人的任何形式的研究成果，也不包

14、含本人已用于其他学位申请的论文或成果，与我一同工作的同志对本研究所做的任何贡献均已在论文中作了明确的说明并表示谢意。论文作者签名：日期：年月日学位论文知识产权权属声明本人在导师指导下所完成的论文及相关的职务作品，知识产权归属山东轻工业学院。山东轻工业学院享有以任何方式发表、复制、公开阅览、借阅以及申请专利等权利，同意学校保留并向国家有关部门或机构送交论文的复印件和电子版，本人离校后发表或使用学位论文或与该论文直接相关的学术论文或成果时，署名单位仍然为山东轻工业学院。论文作者签名：日期：年月日导师签名：日期：年月日山东轻工业学院硕士学位论文

15、1 第 1 章绪论第 1 章绪论 1.1 数据挖掘产生背景随着社会各领域的发展，人们对数据需要记录的量也越来越多，原来人类对于某些认识不够透彻的领域，现在也有了越来越多的理解，相应的，现在的研究所需要的数据量也越来越大，以前说书籍浩如烟海，但是现在的科研可以说确确实实进入了数据爆炸的年代，比如，原来我们认为不需要大量数据的考古，现在要收集很多方面的数据，然后录入计算机，用计算机来模拟推测出历史上原本面貌。在高能物理研究方面，需要记录实验的极其大量的数据。众所周知，我们的商场的经营受到天气的影响，现在的农业也受到天气的影响，现在的航海、航空航天等都要受到天气的影响，显然，大气物理研究方面的工作就是相当重要的，可是，为了确切的知道天气的变化情况，仅仅需要解决的建立在数学模型上的线性方程组就要有几千个，甚至更多，显然，这只是某一个极其短暂时刻的数据，显然，我们要知道长时间天气预报的话，那么，我们需要处理的数据是多么的多了。一个典型的例子，那就是大家熟知的基因测序的例子了，那么多的数据需要全球范围内的很多国家网络计算机群来完成，毫无疑问，我们可以了解到所要处理的数据量大的惊人。从这个现在科学研究的冰山一角，我们可以了解到如果时间长了，那么我们积累的数据将会将我们“淹没” ，可是，当我们面临温室

展开阅读全文