数据流聚类算法及其应用(可编辑)

资源描述

《数据流聚类算法及其应用(可编辑)》由会员分享，可在线阅读，更多相关《数据流聚类算法及其应用(可编辑)（35页珍藏版）》请在金锄头文库上搜索。

1、数据流聚类算法及其应用南京邮电大学硕士学位论文数据流聚类算法及其应用姓名:余志虎申请学位级别:硕士专业:计算机应用技术指导教师:程春玲2011-03南京邮电大学硕士研究生学位论文摘要摘要近年来,伴随着网络信息技术的高速发展,产生了一种新式的数据模型?数据流。它常常产生于 web上的用户点击、网络入侵检测、实时监控系统或无线传感器网络等动态环境中。相比较传统据集,这些海量的数据流具有快速性、连续性、变化性、无限性等特点,使数据流的挖掘面临着新的要求和挑战。聚类分析作为数据挖掘领域的一个重要课题,能够使未标记数据按照指定属性分组为不同的类,在近期得到广泛研究和高度重视。本文以数据流聚类算法为

2、研究内容,异常数据点的检测为研究目标,主要作了以下三个方面的工作:1 总结了数据流模型及其聚类的相关概念和技术,并描述了数据流聚类的特殊要求以及目前国内外数据流聚类算法。同时说明了异常检测的定义、现有方法以及当前所面临的挑战。 2 在高速网络中,数据流具有高速、突发等特性,使得高速网络中的异常检测成为一个难点。本文提出了一种基于 SSClu树的流聚类算法用于高速流的异常检测。算法首先引入一种维持数据流概要信息的 SSClu树;然后针对数据流的高速特性,采用预先聚集和缓存机制。预先聚集是在数据流对象插入 SSClu树聚类之前对其进行预先聚类的过程,以处理突发高速数据流的到达;缓存机制是用于当高速

3、流到达时,暂存当前来不及处理的数据流对象,解决了高速流不能及时聚类的问题。仿真结果表明,本算法能及时处理高速数据流,且具有较高的聚类精度,保证了高速流下异常检测的准确性。 3 针对无线传感器网络中的离群点检测问题,考虑到无线传感器网络Wireless Sensor Network,WSN环境分布式以及能源消耗的限制,提出了一种基于相似性群集模型的流聚类算法Stream Cluster algorithm Based on Similarity Flocking model,SCBSF。算法采用一种模拟群体运动的群集模型将数据自我组织来形成聚类,这种自组织性更加适用于分布式环境批量数据点的聚类;

4、同时通过群集规则来完成任意形状簇的聚类,而不需要采用传统二阶段聚类思想,减少了算法计算和存储复杂度;考虑到 WSN中算法的能耗问题,在采集节点端,利用初始聚类信息来临时记录所产生的相似数据特征,以此来减少数据传输从而达到降低通信能耗的效果。仿真结果表明,算法不仅具有较好的离群点检测效果,同时也降低了聚类过程中数据计算和传输的能源消耗。关键词: 数据流模型,聚类算法,异常检测,高速流,无线传感器网络 I 南京邮电大学硕士研究生学位论文ABSTRACTABSTRACTRecently, with the rapid development of information technology, a

5、new data model called the data stream appears. It often arises from dynamic environment such as user clicking on the web, network intrusion detection, real-time monitoring systems or wireless sensor networks. Compared to traditional data sets, these vast amounts of data streams have fast, continuity

6、, variety, infinity and other characteristics. So data stream mining is facing new demands and challenges. Cluster analysis as a data mining tool is an important topic, because it makes the data without marker group into different classes in accordance with the specified attributes, and has been wid

7、ely studied and highly regarded in the near future. In this paper, we do research on data stream clustering algorithm and anomaly detectionThe main tasks are described as follows: 1 We make a summary of the data flow model and related concepts of cluster, and describe the special requirements and ar

8、ithmetic of current data stream clustering; the definition of anomaly detection, the existing methods and current challenges are also illustrated latterly2 In the high-speed network, data streams with high-speed and sudden features make high-speed network anomaly detection become a difficulty. A str

9、eam clustering algorithm based on SSClu tree for high-speed flow anomaly detection is proposed. The algorithm firstly introduces an SSClu tree to maintain summary information of the data stream; and as for high-speed characteristics of data stream, we use the pre-aggregation and the caching mechanis

10、m. The pre-aggregation is a process of beforehand cluster before data flow objects was inserted into SSClu tree clustering in order to dispose the situation of high-speed data stream; the caching mechanism temporarily is used to save the flow of data currently being processed to solve the arriving b

11、urst data stream. The simulation indicates that the algorithm can not only handles high-speed data streams in a timely manner, but also has a high clustering accuracy and ensures the high accuracy of anomaly detection3 Taking into account constraints of distributed environment and energy consumption

12、 in the wireless sensor network Wireless Sensor Network, WSN, a clustering algorithm is proposed based on similarity flocking model stream SCBSF to solve the outlier detection for Wireless sensor networks. This algorithm use a flocking model simulating swarm activity to form self-organizing II 南京邮电大

13、学硕士研究生学位论文ABSTRACTdata clustering to make the algorithm more suitable for distributed environments of large data collected sets ; it also completes clustering of arbitrary shape by flocking rule without thinking of the traditional two-stage clustering to reduce the algorithm computation and storage

14、complexity; taking the energy consumption into account in WSN, we reduce communication energy through collection nodes which use the initial cluster information .The initial cluster information is generated by the temporary similar data characteristicsSimulation shows that the algorithm not only has

15、 a good results of outlier detection, but also reduces the clustering process data calculation and transmission of energy consumption Key words: Data Stream Model, Clustering Algorithm, Anomaly Detection, High-speed Stream, Wireless Sensor NetworksIII南京邮电大学学位论文原创性声明本人声明所呈交的学位论文是我个人在导师指导下进行的研究工作及取得的研究成果。尽我所知,除了文中特别加以标注和致谢的地方外,论文中不包含其他人已经发表或撰写过的研究成果,也不包含为获得南京邮电大学或其它教育机构的学位或证书而使用过的材料。与我一同工作的同志对本研究所做的任何贡献均已在论文中作了明确的说明并表示了谢意。研究生签名:_ 日期:_南京邮电大学学位论文使用授权声明南京邮电大学、中国科学技术信息研究所、国家图书馆有权保留本人所送交学位论文的复印件和电子文档,可以采用影印、缩印或其它复制手段保存论文。本文电子文档的内容和纸质论文的内容相一致。除在保密期内的保密论文外,允许论文被查阅和借阅,可以公布(包括刊登)论文的

展开阅读全文