一种基于快速k-近邻的最小生成树离群检测方法

资源描述

《一种基于快速k-近邻的最小生成树离群检测方法》由会员分享，可在线阅读，更多相关《一种基于快速k-近邻的最小生成树离群检测方法（25页珍藏版）》请在金锄头文库上搜索。

1、一种基于快速 k-近邻的最小生成树离群检测方法朱利邱媛媛于帅原盛西安交通大学软件学院摘要：离群检测也称异常点检测, 是数据挖掘领域很有意义的热点问题之一, 在很多方面都有广泛应用, 如入侵行为、欺诈行为、医学上疾病前期的征兆等.基于k-近邻的算法能够很好的运用到大数据集上, 因此在基于距离和基于密度的离群检测技术方面得到广泛应用.然而 k-近邻算法的时间复杂度为 O (N2) , 随着数据集规模的增加, 时间开销大大增加.基于最小生成树的聚类算法在使用Prim 或者 Kruskal 算法构建最小生成树时空间复杂度和时间复杂度均为 O (N2) , 聚类结果依赖于用户参数的选择,

2、而且容易漏检稠密簇中的局部离群点.针对以上问题, 融合基于密度和基于聚类方法的优势, 提出一种新的离群检测方法.该方法具有以下优点: (1) 计算 k-近邻的时间复杂度为 O (kN) (kN) ; (2) 构建最小生成树的时间复杂度为 O (NlogN) ; (3) 自适应识别聚类数目; (4) 能够检测出多种类型的离群数据.最后通过大量实验验证了文中所提的 KDNS 算法, FkNN 算法和 ADC 算法的有效性.实验结果表明, 相对于现有算法, 文中算法可以大幅度降低时间复杂度并显著提高离群检测率.关键词：良分割对; 最小生成树; k-近邻; 自适应聚类; 离群检测; 数据挖掘; 作

3、者简介：朱利, 男, 1968 年生, 博士, 副教授, 主要研究方向为机器学习、数字媒体理解.E-mail:.作者简介：邱媛媛, 女, 1988 年生, 硕士研究生, 主要研究方向为数据挖掘、机器学习.作者简介：于帅, 男, 1991 年生, 硕士研究生, 主要研究方向为推荐算法、数据挖掘.作者简介：原盛, 男, 1976 年生, 博士, 讲师, 主要研究方向为数字媒体理解.收稿日期：2016-07-11基金：国家自然基金项目 (61473220) A Fast kNN-Based MST Outlier Detection MethodZHU Li QIU Yuan-Yuan YU Shu

4、ai YUAN Sheng School of Software Engineering, Xian Jiaotong University; Abstract： Outlier detection, also known as anomaly detection, is a very important foundamental research task in the field of data mining.It is mainly used for finding strange mechanism or potential danger, and aims to detecting

5、those outliers their observations deviate so much from other observations and they are few suspicious data.Outliers, which are novel, unmoral and few, are often abandoned as noise or abnormal data.Outliers are also classified as many types, such as local, partial and so on.The techniques of outlier

6、detection can be applied to many fields such as intrusion behavior, fraud, signs of early disease in the medical field and so on.Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection.The kNN-based algorithm could

7、 be used in big data sets efficiently, so it is widely applied for outliers detection based on distance and density.Unfortunately, the kNN-based algorithms time complexity is O (N2) , and it will be greatly increased with the size of date sets.The time complexity and space complexity of minimum span

8、ning tree-based clustering algorithms using Prims or Kruskals method is O (N2) , and the result of clustering depends on inputting parameters by users.Moreover, this algorithm cant detect outliers in high-density clusters.The existing MST-based algorithms become ineffective when provided with unsuit

9、able parameters or applied to datasets which are composed of clusters with diverse shapes, sizes, anddensities.Meanwhile, the most MST algorithms couldnt build tree dynamically, because of needing to know the distance between any two points in advance.In order to address these challenging problems,

10、we proposed a new outliers detection method, which absorbs the advantages of distance-based method and density-based method.Firstly, this algorithm builds a split-tree to storage the information among data points.Secondly, we efficiently acquire all sets of well-separated pair decomposition on the w

11、hole dataset.Thirdly, all this algorithm partitions the input data set into several frames which are satisfy certain condition so that we can quickly obtain each points k-nearest neighbors on the basis of the first two results.Fourthly, a minimum spinning tree is dynamically built according to the t

12、hird result.In addition, we rank points which are suspected as outliers on the basis of its outlier factor by using the MST-based clustering without inputting parameter of cluster numbers manually.A new algorithm and a new metric are proposed to select the exact number of clusters and avoid insignif

13、icant clusters.And we detect all outliers at last.The time complexity of computing kNN and creating tree are O (kN) and O (NlogN) , respectively.The experiments show that this new algorithm can detect both local outliers and global outliers without inputting the number of clusters from users.In the

14、experiments, we use a series of real datasets and synthetic datasets to verify the efficiency and effectiveness of KDNS, FkNN and ADC proposed in this paper.The experimental results show that comparing with the previous approaches, our proposed algorithms can drastically reduce time complexity and s

15、ignificantly improve the rate of outlier detection.Keyword： well separated pair; MST; kNN; auto clustering; outlier detection; data mining; Received： 2016-07-111 引言离群检测是数据挖掘研究的一项重要任务, 目的是发现在不同机制下, 那些显著区别于大部分数据, 且足够引起怀疑的少部分数据1.这些新颖的不合常规的少部分离群数据通常被视为噪声或者异常点而被抛弃, 然而在很多应用中这些离群数据可能蕴涵着重要的隐藏信息, 如入侵行为、欺诈行为、

16、医学上疾病前期的征兆等.根据与整个数据集的差异度, 离群数据分为全局离群点和局部离群点.根据离群数据规模划分为离群点和离群簇2.目前离群检测技术大致可以分为:基于统计的方法、基于近邻的方法、基于分类的方法、基于聚类的方法等3.其中基于近邻的方法包括基于距离的离群点检测和基于密度的离群点检测方法.基于距离的离群点检测方法最早由 Knorr 和 Ng 提出4-5, 克服了基于统计的离群点检测算法中对数据的分布模型和分布参数的限制, 并且适用于任何维度的特征空间.该方法将每个数据点到其近邻数据点之间的距离大小作为离群的度量参照.如果数据集中一个对象 t 与距离均值 u 的偏差大于或者等于用户指定一个距离阈值, 则认为 t 就是一个离群点.随后, Boriah 等人6提出了可以处理大数据集的一个基于 k-近邻 (kNN) 度量离群度的方法.该算法的主要缺陷在于计算每一个数据点的近邻时都要扫描数据集一遍, 随着数据集规模的增加, 计算效率非常低.Ghoting 等人7做了进一步的改善,

展开阅读全文