使用sas进行knn分类和回归

资源描述

《使用sas进行knn分类和回归》由会员分享，可在线阅读，更多相关《使用sas进行knn分类和回归（16页珍藏版）》请在金锄头文库上搜索。

1、Paper SD-09 KNN Classifi cation and Regression using SAS R? Liang Xie, The Travelers Companies, Inc. ABSTRACT K-Nearest Neighbor (KNN) classifi cation and regression are two widely used analytic methods in predictive modeling and data mining fi elds. They provide a way to model highly nonlinear deci

2、sion boundaries, and to fulfi ll many other analytical tasks such as missing value imputation, local smoothing, etc. In this paper, we discuss ways in SAS R? to conduct KNN classifi cation and KNN Regression. Specifi cally, PROC DISCRIM is used to build multi-class KNN classifi cation and PROC KRIGE

3、2D is used for KNN regression tasks. Technical details such as tuning parameter selection, etc are discussed. We also discuss tips and tricks in using these two procedures for KNN classifi cation and regression. Examples are presented to demonstrate full process fl ow in applying KNN classifi cation

4、 and regression in real world business projects. INTRODUCTION kNN stands for k Nearest Neighbor.In data mining and predictive modeling, it refers to a memory-based (or instance-based) algorithm for classifi cation and regression problems. It is a widely used algorithm with many suc- cessfully applic

5、ations in medical research, business applications, etc. In fact, according to Google Analytics, it is the second most viewed article on my SAS programming blog of all time, with more than 2200 views a year. In classifi cation problems, the label of potential objects is determined by the labels of cl

6、osest training data points in the feature space. The determination process is either through ”majority voting” or ”averaging”. In ”majority voting”, the label of object is assigned to be the label which most frequent among the k closest training examples. In ”averaging”, the object is not assigned a

7、 label, but instead, the ratio of each class among the k closest training data points. In Regression problems, the property of object is obtained via a similar ”averaging” process, where the value of the object is the average value of the k closest training points. In practice, both the ”majority vo

8、ting” and ”averaging” process can be refi ned by adding weights to the k closest training points, where the weights are proportional to the distance between object and the training point. In this way, the closest points will have the biggest infl uence on the fi nal results. In fact, the SAS impleme

9、ntation of kNN classifi cation has the averaging process be weighted by volume size. Interested readers are encouraged to read the manual for details. CHARACTERISTICS OF KNN-BASED ALGORITHMS kNN algorithm has several characteristics that worth addressing. Understanding its advantages and disadvantag

10、es in theory and in practice will help practicioners better leverage this tool. Generally, KNN algorithm has 4 advantages worth noting. 1. It is simple to implement. Theoretically, kNN algorithm is very simple to implement. The naive version of the al- gorithm is easy to implement. For every data po

11、int in the test sample, directly computing the desired distances to all stored vectors, and choose those shortest k examples among stored vectors. It is, however, computa- tionally intensive, especially when the size of the training set grows. Over the years, many nearest neighbor search algorithms

12、have been proposed, seeking to reduce the number of distance evaluations actually per- formed. Using an appropriate nearest neighbor search algorithm makes k-NN computationally tractable even for large data sets. In SAS, the k-d Tree data structure and associated search algorithm of Friedman et. al.

13、 is implemented. 2. It is analytically tractable. 3. It is highly adaptive to local information. kNN algorithm uses the closest data points for estimation, therefore it is able to take full advantage of local information and form highly nonlinear, highly adaptive decision boundaries for each data po

14、int. 1 4. It is easily implemented in parallel. Because it is instance based, for each data point to be scored, the algorithm check against the training table for the k nearest neighbor. Since each data point is independent of the others, the execution of search and score can be conducted in paralle

15、l. On the other hand, kNN algorithm has 3 disadvantages, which will also be addressed in details in section PRACTI- CAL ISSUES ON IMPLEMENTATION. 1. It is computationally very intensive, because for each records to be classifi ed and/or scored, the program has to scan through available data points t

16、o fi nd the nearest neighbors and then determine which ones to use for further calculation. 2. It is demanding in storage, because all records in training sample have to be stored and used whenever scoring is necessary. 3. It is highly susceptible to curse of dimensionality, because as dimension increases, the distance calculation and fi nding the appropriate space to contain at least one relevant record is becoming diffi cult. KNN USING SAS In order to conduct either KNN Classifi cation or KNN

展开阅读全文

使用sas进行knn分类和回归

最新文档