机器学习工具WEKA的使用总结包括算法选择、属性选择、参数优化

资源描述

《机器学习工具WEKA的使用总结包括算法选择、属性选择、参数优化》由会员分享，可在线阅读，更多相关《机器学习工具WEKA的使用总结包括算法选择、属性选择、参数优化（13页珍藏版）》请在金锄头文库上搜索。

1、一、属性选择:1、理论知识:见以下两篇文章：数据挖掘中的特征选择算法综述及基于WEKA的性能比较陈良龙数据挖掘中约简技术与属性选择的研究_刘辉2、weka中的属性选择2.1 评价策略(attribute evaluator)总的可分为filter和wrapper方法，前者注重对单个属性进行评价，后者侧重对特征子集进行评价。Wrapper 方法有：CfsSubsetEvalFilter 方法有：CorrelationAttributeEval2.1.1 Wrapper 方法：(1) CfsSubsetEval根据属性子集中每一个特征的预测能力以及它们之间的关联性进行评估，单个特征预测能力强且

2、特征子集内的相关性低的子集表现好。Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them.Subsets of features that are highly correlated with the class while having low intercorrelation are preferred.For more info

3、rmation see:M. A. Hall (1998). Correlation-based Feature Subset Selection for Machine Learning. Hamilton, New Zealand.(2) WrapperSubsetEvalWrapper方法中，用后续的学习算法嵌入到特征选择过程中，通过测试特征子集在此算法上的预测性能来决定其优劣，而极少关注特征子集中每个特征的预测性能。因此，并不要求最优特征子集中的每个特征都是最优的。Evaluates attribute sets by using a learning scheme. Cross

4、validation is used to estimate the accuracy of the learning scheme for a set of attributes.For more information see:Ron Kohavi, George H. John (1997). Wrappers for feature subset selection.Artificial Intelligence. 97(1-2):273-324.2.1.2 Filter 方法：如果选用此评价策略，则搜索策略必须用Ranker。(1) CorrelationAttributeEval根

5、据单个属性和类别的相关性进行选择。Evaluates the worth of an attribute by measuring the correlation (Pearsons) between it and the class.Nominal attributes are considered on a value by value basis by treating each value as an indicator. An overall correlation for a nominal attribute is arrived at via a weighted averag

6、e.(2) GainRatioAttributeEval根据信息增益比选择属性。Evaluates the worth of an attribute by measuring the gain ratio with respect to the class.GainR(Class, Attribute) = (H(Class) - H(Class | Attribute) / H(Attribute).(3) InfoGainAttributeEval根据信息增益选择属性。Evaluates the worth of an attribute by measuring the informa

7、tion gain with respect to the class.InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute).(4) OneRAttributeEval根据OneR分类器评估属性。Class for building and using a 1R classifier; in other words, uses the minimum-error attribute for prediction, discretizing numeric attributes. For more information, see

8、:R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.(5) Principalcomponents主成分分析(PCA)。Performs a principal components analysis and transformation of the data. Use in conjunction with a Ranker search. Dimensionality reduction is

9、 accomplished by choosing enough eigenvectors to account for some percentage of the variance in the original data-default 0.95 (95%). Attribute noise can be filtered by transforming to the PC space, eliminating some of the worst eigenvectors, and then transforming back to the original space.(6) Reli

10、efFAttributeEval根据ReliefF值评估属性。Evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class.Can operate on both discrete and continuous class data.For more information see:Kenji Kira,

11、Larry A. Rendell: A Practical Approach to Feature Selection. In: Ninth International Workshop on Machine Learning, 249-256, 1992.Igor Kononenko: Estimating Attributes: Analysis and Extensions of RELIEF. In: European Conference on Machine Learning, 171-182, 1994.Marko Robnik-Sikonja, Igor Kononenko:

12、An adaptation of Relief for attribute estimation in regression. In: Fourteenth International Conference on Machine Learning, 296-304, 1997.(7) SymmetricalUncertAttributeEval根据属性的对称不确定性评估属性。Evaluates the worth of an attribute by measuring the symmetrical uncertainty with respect to the class.SymmU(Cl

13、ass, Attribute) = 2 * (H(Class) - H(Class | Attribute) / H(Class) +H(Attribute).2.2 搜索策略(Search Method)2.2.1和评价策略中的wrapper方法对应(1) BestFirst最好优先的搜索策略。是一种贪心搜索策略。Searches the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility. Setting the number of consecutive non-

14、improving nodes allowed controls the level of backtracking done. Best first may start with the empty set of attributes and search forward, or start with the full set of attributes and search backward, or start at any point and search in both directions (by considering all possible single attribute a

15、dditions and deletions at a given point).(2) ExhaustiveSearch穷举搜索所有可能的属性子集。Performs an exhaustive search through the space of attribute subsets starting from the empty set of attrubutes. Reports the best subset found.(3) GeneticSearch基于Goldberg在1989年提出的简单遗传算法进行的搜索。Performs a search using the simple

16、genetic algorithm described in Goldberg (1989).For more information see:David E. Goldberg (1989). Genetic algorithms in search, optimization and machine learning.Addison-Wesley.(4) GreedyStepwise向前或向后的单步搜索。Performs a greedy forward or backward search through the space of attribute subsets.May start with no/all attributes or from an arbitrary point in the

展开阅读全文