第2课数据预处理技术

资源描述

《第2课数据预处理技术》由会员分享，可在线阅读，更多相关《第2课数据预处理技术（50页珍藏版）》请在金锄头文库上搜索。

1、第2课数据预处理技术徐从富，副教授浙江大学人工智能研究所浙江大学本科生数据挖掘导论课件内容提纲nWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummaryI.Why Data Preprocessing? nData in the real world is dirtyincomplete: lacking attribute values, lacking

2、 certain attributes of interest, or containing only aggregate datane.g., occupation=“”noisy: containing errors or outliersne.g., Salary=“-10”inconsistent: containing discrepancies in codes or namesne.g., Age=“42” Birthday=“03/07/1997”ne.g., Was rating “1,2,3”, now rating “A, B, C”ne.g., discrepancy

3、between duplicate recordsWhy Is Data Dirty?nIncomplete data comes fromn/a data value when collecteddifferent consideration between the time when the data was collected and when it is analyzed.human/hardware/software problemsnNoisy data comes from the process of datacollectionentrytransmissionnIncons

4、istent data comes fromDifferent data sourcesFunctional dependency violationWhy Is Data Preprocessing Important?nNo quality data, no quality mining results!Quality decisions must be based on quality datane.g., duplicate or missing data may cause incorrect or even misleading statistics.Data warehouse

5、needs consistent integration of quality datanData extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. Bill InmonMulti-Dimensional Measure of Data QualitynA well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievability

6、Value addedInterpretabilityAccessibilitynBroad categories:intrinsic, contextual, representational, and accessibility.Major Tasks in Data Preprocessing nData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesnData integrationIntegration of mult

7、iple databases, data cubes, or filesnData transformationNormalization and aggregationnData reductionObtains reduced representation in volume but produces the same or similar analytical resultsnData discretizationPart of data reduction but with particular importance, especially for numerical dataForm

8、s of data preprocessing II.Data CleaningnImportance“Data cleaning is one of the three biggest problems in data warehousing”Ralph Kimball“Data cleaning is the number one problem in data warehousing”DCI surveynData cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct

9、 inconsistent dataResolve redundancy caused by data integrationMissing DatanData is not always availableE.g., many tuples have no recorded value for several attributes, such as customer income in sales datanMissing data may be due to equipment malfunctioninconsistent with other recorded data and thu

10、s deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the datanMissing data may need to be inferred.How to Handle Missing Data?nIgnore the tupleusually done when class label is missing (assuming the tasks

11、in classificationnot effective when the percentage of missing values per attribute varies considerably).nFill in the missing value manually tedious + infeasible?nFill in it automatically witha global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belo

12、nging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision treeNoisy DatanNoise: random error or variance in a measured variablenIncorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemstec

13、hnology limitationinconsistency in naming convention nOther data problems which requires data cleaningduplicate recordsincomplete datainconsistent dataHow to Handle Noisy Data?nBinning method:first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by bin median,

14、smooth by bin boundaries, etc.nClusteringdetect and remove outliersnCombined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)nRegressionsmooth by fitting the data into regression functionsSimple Discretization Methods: BinningnEqual-width (

15、distance) partitioning:Divides the range into N intervals of equal size: uniform gridif A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N.The most straightforward, but outliers may dominate presentationSkewed data is not handled well.nEqual-depth

16、 (frequency) partitioning:Divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be tricky.Binning Methods for Data Smoothing Sorted data for price (in dollars)4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

展开阅读全文