数据流的最大频繁模式挖掘研究

资源描述

《数据流的最大频繁模式挖掘研究》由会员分享，可在线阅读，更多相关《数据流的最大频繁模式挖掘研究（56页珍藏版）》请在金锄头文库上搜索。

1、西安科技大学硕士学位论文数据流的最大频繁模式挖掘研究姓名：陈艳申请学位级别：硕士专业：计算机应用技术指导教师：杨君锐论文题目：数据流的最大频繁模式挖掘研究专业：计算机应用技术硕士生：陈艳（签名）指导教师：杨君锐（签名）摘要关联规则挖掘是数据挖掘领域中一个重要研究方向，频繁模式挖掘是关联规则、时序模式挖掘等应用中的关键技术和步骤，而数据流频繁模式挖掘又是当前频繁模式挖掘的一个热点问题。然而，由于挖掘频繁模式内在的计算复杂性，为了提高挖掘效率，业界提出了最大频繁模式挖掘问题。最大频繁模式隐含了所有的频繁模式，同时，在规模上，最大频繁模式小于频繁模式，

2、并且在某些数据挖掘应用中仅需挖掘出最大频繁模式。在现实情况下，由于一些数据流的流动速度是非恒定的，因此如何达到对非恒定流速的数据流进行挖掘是一个值得研究的问题；再有，使挖掘结果更好的体现新事务，而降低早到达的旧事务对整个挖掘结果的影响也是业界关心的一个热点问题。因此，对这些问题进行研究具有重要意义。本文主要研究了数据流挖掘中的相关问题，主要包括以下内容： (1)提出了一个基于数据流的最大频繁模式挖掘算法BFPM-Stream。该算法采用事务和时间相结合的滑动窗口方法来解决数据流的流速不确定性问题，同时利用位对象数据表示方法和位频繁模式树BFP-Tree等对数据进行存储和处理。实

3、验结果验证了 BFPM-Stream算法的有效性。 (2)提出了一个基于事务衰减的数据流最大频繁模式挖掘算法BFPMW-Stream。该算法采用事务滑动窗口，并利用位对象数据表示方法、位频繁模式树BFP-Tree和存储模式树P-Tree等对数据进行存储和处理，从而针对数据流中的新旧事务的不同作用挖掘出最大频繁模式。实验结果验证了BFPMW-Stream算法的有效性。关键词：数据挖掘；数据流；最大频繁模式；滑动窗口；事务衰减研究类型：理论研究 Subject : Research on Mining Maximal Frequent Patterns over Data Strea

4、m Specialty : Computer Application Technology Name : Chen Yan (Signature) Instructor : Yang Junrui (Signature) ABSTRACT The association rule mining is a very important problem in data mining. The issue of mining frequent patterns plays a crucial role in association rule mining、sequential pattern min

5、ing, etc. Mining frequent patterns over data streams is a key issue on study of mining frequent patterns. Because of the time-consuming in mining frequent patterns, mining maximal frequent patterns has been proposed to improve the mining efficiency. The set of maximal frequent patterns contains all

6、sets of the frequent patterns. The set of maximal frequent patterns is orders of magnitude smaller than the set of frequent patterns and there are applications where the set of maximal frequent patterns is adequate. In some applications, because the speed of data streams is non-constant, how to mine

7、 frequent patterns in this kind of data streams is an issue worth studying; Mining results is better of the new transactions and litter of the old transactions is also an interesting issue. In all, it is very significative to do some researchs on those issues. In this paper, we have done some resear

8、ches on the related problems of data stream mining. It is stated as follows: (1)A new algorithm, BFPM-Stream, for mining maximal frequent patterns over data streams was proposed. It used transaction and time sensitive sliding window to resolve the unfavorable effects caused by the unsteady speeds of

9、 data stream. In addition, the methods, which were the bit objects for data expression and bit frequent patterns tree, BFP-Tree, to store and handle the data, were used. The experimental result verifies the efficiency of the BFPM-Stream. (2)A new algorithm based on damped transactions in data stream

10、s, BFPMW-Stream, for mining maximal frequent patterns was proposed. The transaction sliding window was adopted in the algorithm. In addition, the methods, which were the bit objects for data expression、bit frequent patterns tree, BFP-Tree, and storing patterns tree, P-Tree, to store and handle the d

11、ata, were used. It could mine maximal frequent patterns in data streams that were made up of new or old transactions. The experimental result verifies the efficiency of the BFPMW-Stream. Keywords: Data mining Data stream Maximal frequent patterns Sliding window Damped transactions Thesis : Theoretic

12、al Research 1 绪论 1 1 绪论随着数据库技术和计算机网络的广泛应用，人们所拥有的数据量急剧扩大。一方面，人们积累了海量的商业数据；另一方面，相关的分析方法越来越滞后，数据迅速增加与分析方法滞后之间的矛盾越来越突出。导致人们面临“数据丰富，知识贫乏” 1的困境。数据挖掘1,2，又称为数据库中的知识发现(Knowledge Discovery in Database ,简称为 KDD)。它不但可以帮助人们从数据库特别是数据仓库的相关数据中提取出所感兴趣的知识、规律或更高层次的信息，而且也可以帮助人们从不同程度上去分析它们，从而可以更有效地利用数据库或数据仓库中的数

13、据。它不仅可以用于描述过去数据的发展过程，而且还能进一步预测未来的发展趋势，是人工智能、神经网络、数据库、预测理论、机器学习和统计学等多种技术的综合产物。因此，数据挖掘正成为一个新的、日益受到重视的热点研究领域。数据挖掘方法的提出，让人们最终有能力认识到数据的真正价值，即蕴含在数据中的信息和知识。数据挖掘是目前数据库和信息决策领域的最前沿研究方向之一，己经引起了学术界和工业界的广泛关注。目前在国内外的许多高校和研究机构都在从事此领域的研究工作，并产生了大量的研究成果。在最近几年涌现出的一些应用中，比如电子商务、金融应用、电信呼叫记录、制造业、网络监控和流量分析、Web 应用

14、和 Web 点击流、能量消耗管理、传感器网络数据分析、股票市场联机分析和动态跟踪股票的涨落等等，数据都是以成倍的、快速的、随时间变动的、可能是不可预测和无限的流的方式连续到达，即数据以流的形式出现3。这些数据数量巨大，而且增长迅速超出人们的想象。在这些数据流中蕴含着大量的知识和有用的信息。通过获取这些数据流中的信息和知识，我们可以提高系统的性能、获取更多的价值。例如在一些网络中，通过网络流量监控能够发现网络上的瓶颈，从而进行负载均衡提高网络的效率。另外通过网络监控发现网络上的异常来进行入侵检测。除此以外，在金融应用中还可以通过分析交易的数据流进行欺骗检测。数据流的特点使得传统的

15、数据挖掘算法不能直接应用于数据流，因为数据量很大，而且是持续的。与数据流的规模相比，内存或缓存等存储空间是极其有限的，只能对数据进行一次顺序扫描，因此需要一种动态的增量式的算法来处理数据流。并且，多数应用要求在数据到来的同时进行分析决策，这对处理时间提出了更高的要求。如何在数据流上进行数据挖掘，给研究人员带来了新的机遇和挑战。 1.1 数据挖掘数据挖掘从诞生到现在有多种定义，其中得到大家较为公认的是4： Knowledge discovery in databases is the nontrivial process of identifying valid, novel, 西安科

16、技大学硕士学位论文 2 potentially useful, and ultimately understandable patterns in data. 1.1.1 数据挖掘概述数据挖掘的目标是从数据库中发现隐含的、有意义的知识。总体来讲，根据数据挖掘发现的模式分类，可以将其分为两类：描述性数据挖掘和预测性数据挖掘。描述性数据挖掘意在刻画数据的特性和特征。预测性数据挖掘则是意在对当前数据上进行推断，以进行预测。另外，数据挖掘能够发现各种位于不同抽象层的模式。这些数据模式从不同的视角为用户提供领域的知识、聚焦有趣模式的搜索带来了方便。一般来讲，数据挖掘功能大略可以归纳为 6 种：概念描述1、关联分析5,6、分类和

展开阅读全文

数据流的最大频繁模式挖掘研究

最新文档