数据挖掘,第八章:标准规范、工具和发展趋势,本章内容,8.1 数据挖掘标准与规范 8.2 数据挖掘工具 8.3 数据挖掘的研究趋势,基本要求:了解数据挖掘在应用中的相关标准规范及未来的研究趋势8.1 数据挖掘标准与规范,数据挖掘过程模型是确保数据挖掘工作顺利进行的关键典型的过程模型有: SPSS的5A模型评估(Assess)、访问(Access)、分析(Analyze)、行动(Act)、自动化(Automate) SAS的SEMMA模型采样(Sample)、探索(Explore)、修正(Modify)、建模(Model)、评估(Assess) 跨行业数据挖掘过程标准CRISP-DM (Cross Industry Standard Process for Data Mining ) Two Crows公司的数据挖掘过程模型,它与正在建立的CRISP-DM有许多相似之处数据挖掘相关标准 CRISP-DM(交叉行业数据挖掘过程标准,Cross Industry Standard Process for Data Mining)SPSS、NCR以及DaimlerChrysler三个在数据挖掘领域经验丰富的公司发起建立一个社团,目的建立数据挖掘方法和过程的标准,,8.1 数据挖掘标准与规范,Crisp - DM,Project Objectives,Data Understanding,Data Preparation,Modeling,Evaluation,Reporting,BackgroundRequirements, assumptions, constraintsTerminologyData mining goals & success criteriaProject plan,Initial Data collection reportData description reportData Exploration reportData quality report,Data description reportData pre-processing steps,Modeling assumptionTest design Model descriptionModel assessment (inc. validation),Assessment of data mining results withrespect to objectives,Final report: Summary: Objectives Data Mining process Data Mining results Data Mining assessment-Conclusions Future work,(Business Understanding),(Deployment),Widely accepted PROCESS MODEL for data mining Provides a framework for describing the modeling process in detail “BEST PRACTICE”,Business Understanding Phase Understand the business objectives What is the status quo? Understand business processes Associated costs/pain Define the success criteria Develop a glossary of terms: speak the language Cost/Benefit Analysis Current Systems Assessment Identify the key actors Minimum: The Sponsor and the Key User What forms should the output take? Integration of output with existing technology landscape Understand market norms and standards,,8.1 数据挖掘标准与规范,Business Understanding Phase Task Decomposition Break down the objective into sub-tasks Map sub-tasks to data mining problem definitions Identify Constraints Resources Law e.g. Data Protection Build a project plan List assumptions and risk (technical/ financial/ business/ organisational) factors,,8.1 数据挖掘标准与规范,Data Understanding Phase Collect Data What are the data sources? Internal and External Sources (e.g. Axiom, Experian) Document reasons for inclusion/exclusions Depend on a domain expert Accessibility issues Are there issues regarding data distribution across different databases/legacy systems Where are the disconnects?,,8.1 数据挖掘标准与规范,Data Understanding Phase Data Description Document data quality issues Compute basic statistics Data Exploration Simple univariate data plots/distributions Investigate attribute interactions Data Quality Issues Missing Values: Understand its source Strange Distributions,,8.1 数据挖掘标准与规范,Data Preparation Phase Integrate Data Joining multiple data tables Summarisation/aggregation of data Select Data Attribute subset selection Rationale for Inclusion/Exclusion Data sampling Training/Validation and Test sets,,8.1 数据挖掘标准与规范,Data Preparation Phase Data Transformation Using functions such as log Factor/Principal Components analysis Normalization/Discretisation/Binarisation Clean Data Handling missing values/Outliers Data Construction Derived Attributes,,8.1 数据挖掘标准与规范,The Modeling Phase Build Model Choose initial parameter settings Study model behaviour: Sensitivity analysis Assess the model Beware of over-fitting Investigate the error distribution: Identify segments of the state space where the model is less effective Iteratively adjust parameter settings,,8.1 数据挖掘标准与规范,The Evaluation Phase Validate Model Human evaluation of results by domain experts Evaluate usefulness of results from business perspective Define control groups Calculate lift curves Expected Return on Investment Review Process Determine next steps Potential for deployment Deployment architecture Metrics for success of deployment,,8.1 数据挖掘标准与规范,PMML(预测模型标记语言,Predictive Model Markup Language)。
数据挖掘应用往往需要多种类型的数据挖掘软件、算法协同运行,这就要求对挖掘出的模型能够很好地继承、复用与集成DMG(The Data Mining Group,DMG)提出PMML语言 PMML最新版本为4.1,支持16种数据挖掘模型,包括:AssociationModel (关联规则)、BaselineModel(基准模型)、ClusteringModel(聚类模型)、GeneralRegressionModel(回归模型)、MiningModel(组合模型)、NaiveBayesModel(朴素贝叶斯)、 NearestNeighborModel (最近邻模型)NeuralNetwork(神经网络)、RegressionModel(线性、多项式、对数三种回归模型)、RuleSetModel(规则集)、 SequenceModel(序列模式)、Scorecard、TimeSeriesModel、SupportVectorMachineModel(支持向量机)、 TextModel(文本模型)、TreeModel(决策树),,8.1 数据挖掘标准与规范,PMML的模型定义由以下几部分组成:,,8.1 数据挖掘标准与规范,The header element contains general information about the PMML document, such as copyright formation for the model, its description, and information about the application used to generate the model such as name and version.,,8.1 数据挖掘标准与规范, ,The data dictionary records information about the data elds from which the model was built.,,8.1 数据挖掘标准与规范, ,Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of simple data transfor。