山东大学计算机学院

资源描述

《山东大学计算机学院》由会员分享，可在线阅读，更多相关《山东大学计算机学院（22页珍藏版）》请在金锄头文库上搜索。

1、山东大学计算机学院山东大学计算机学院MakingInterval-BasedClusteringRank-Aware报告人：李婷山东大学计算机学院山东大学计算机学院出处：InternationalConferenceonExtendingDatabaseTechnology作者：JuliaStoyanovich，SihemAmer-Yahia,TovaMiloUniversityofPennsylvaniaPhiladelphia研究方向：Databases、Webdatamanagement、WebservicesandWebapplications、BusinessProcesses山东大

2、学计算机学院山东大学计算机学院ContentsI.IntroductionII.FormalismIII.TheBARACAlgorithmIV.Evaluationofeffectiveness山东大学计算机学院山东大学计算机学院I.IntroductionIndatingsites,ausermayspecifytheage,height,income,education,politicalaffiliation,andreligionofapotentialmatch.Inrealestateapplications,ausermaydescribehisdreamhomebyitslo

3、cation,size,andnumberofbedrooms.Thenumberofmatchesisoftenveryhigh,makingdataexplorationaninterestingchallenge.Typicallyusersalsospecifyrankingcriteriafortheretrieveditems,e.g.,asortorderonasingleattribute,oraweightedcombinationofmultipleattributes.排序帮助用户，依据他们的标准提供高质量的数据，同时导致同类的匹配数据，用户需要浏览大量的数据后，才能找到

4、下一类数据。.山东大学计算机学院山东大学计算机学院I.Introduction例如：adatingwebsite用户：lookingforapartnerbetween20and40yearsold,sortingthematchesbyincomefromhighertolower结果：seeingalargenumberofmatchesintheirlate30swhoholdanMBAdegreeandworkinthefinancialindustry,beforeseeinganymatchesindifferentagegroupsandwalksoflife.更合理的结果展示：

5、在结果集中，找到数据属性的聚类。Theseclustersmaydescribematchesbetween35and40withanMBA,matchesbetween25and30whoworkinthesoftwareindustry,etc.,allowingfordataexplorationofrankedresults.山东大学计算机学院山东大学计算机学院I.Introduction本文提出的聚类算法，得到的聚类具备三个性质：.1)ClusteringQualityrank-awareclusteringqualityMeasures(1)QtopN:treatthetopNit

6、emsofeachintervalassets(2)QSCORE:accountforthescoresoftheitems(3)QSCORE&RANK:accountforbothscoresandranks2)Tightness对于一个有序区间，两个或多个连续的区间连接在一起，会产生一个更大的区间，但是并不一定增加新的items，因此可能会产生一个错误的聚类描述。例如：按收入从高到底排序，我们发现20到24岁的得分要低于25到29岁的用户。也就是20到29岁包含的items与25到29岁的一致，认为这个聚类不紧密。3)Maximality需要发现与区间集合下尽可能多的items。例如：in

7、tervalsI1:age20,29,I2:edu=MBA,andI3:income75K,100K.IftwodimensionalclustersI1、I2,I1、I3,andI2、I3arediscovered,aswellasathree-dimensionalclusterI1、I2、I3,thenonlyI1、I2、I3ispresentedtotheuserI.Introduction-Contributions定义了基于区间的感知顺序的聚类，及相应的衡量聚类质量策略方法。提出了一个算法BARAC，自上而下的聚类算法，并验证了方法的有效性。.II.FormalismA：Inter

8、valandSubspaces1、RankedInterval与Intervaldominance（1）数据集D：由属性-值对描述属性集合A：每个属性由name和一个值域表示。（2）一种计算聚类质量的方法，用于定义聚类算法的搜索空间II.FormalismA：IntervalandSubspacesII.FormalismA：IntervalandSubspaces(2)例如：givenS:I1,I2andS:I1,I2,I3,SS2、RankedSubspace与SubspaceInclusion(1)Intervals的数量为子空间的维数。例如一个三维子空间：S:age20,24,edu=

9、MS,inc75K,100K.II.FormalismB：Rank-AwareClustering1）Rank-AwareClusteringQuality考虑了三种衡量聚类质量的方法：将子空间的质量与阈值Q(0.5,1比较。三种测量方法：QtopN：如果子空间包含的items，多数来至子空间的每个区间里。QSCORE：包含多数高分的items。例如：QSCORE：N=3andQ=0.8.S2：QSCORE=(500+100+80)/(500+175+150)=0.824,QSCORE&RANK：考虑了items的排序和得分。在靠前的顺序中，得分较高。II.FormalismB：Rank-Aw

10、areClustering1）Rank-AwareClusteringQualityII.FormalismB：Rank-AwareClustering1）Rank-AwareClusteringQualityTheorem2.1.ThedownwardclosurepropertyholdsforQtopN,QSCORE,andQSCORE&RANK.Proof：QtopN：|AB|min(|A|,|B|).ThusthevalueofQtopNisnonincreasingwithincreasingdimensionality.QSCOREandQSCORE&RANK：thevalues

11、ofQSCOREandQSCORE&RANKarestrictlynon-increasingwithincreasingdimensionalityII.FormalismB：Rank-AwareClustering2）Tightness对于有序属性，合并两个连续的区间可能并没有任何意义。例如按年龄排序，I1:age20,24andI2:age25,29，thetop-10itemsinI1+I2originatefromI1。对于排序函数R3，I1+I2与I1和I2的top-N数据项不同，I1+I2是tight的。因此增加I1,I2,andI1+I2到搜索空间中是有意义的，用户可以发现使用

12、I1或者I2都无法发现的聚类。II.FormalismB：Rank-AwareClustering3）Maximality图2所示，S2是最大化子空间，而S2:I3,I4不是。II.FormalismC：ProblemStatement本文需要解决的问题为对于数据集D,得分函数R,聚类质量测量方法Q,整数N,dominance的阈值dom,聚类质量阈值Q,找到所有聚类。III.TheBARACAlgorithmBARAC,至下而上的聚类算法III.TheBARACAlgorithmA：BuildGridBARAC-BuildGridIII.TheBARACAlgorithmB：MergeBARAC-MergeIII.TheBARACAlgorithmC：JoinBARAC-JoinIII.TheBARACAlgorithmD：ChooseBARAC-Choose第一种方式：从Join产生的子空间中，选择最大化的子空间大部分用户更倾向于有多种描述的聚类，包括属性名称和取值范围。第二种方式：最大化聚类item的得分本文使用K-means，识别出maxClusters聚类，maxClusters为选择的聚类数量山东大学计算机学院山东大学计算机学院谢谢！

展开阅读全文

山东大学计算机学院

最新文档