MACHINE LEARNING ON SPARK - UC BERKELEY AMP CAMP

资源描述

《MACHINE LEARNING ON SPARK - UC BERKELEY AMP CAMP》由会员分享，可在线阅读，更多相关《MACHINE LEARNING ON SPARK - UC BERKELEY AMP CAMP（34页珍藏版）》请在金锄头文库上搜索。

1、Machine Learning on SparkShivaram Venkataraman UC BerkeleyComputer Science Machine learningStatisticsMachine learningSpam filtersRecommendationsClick predictionSearch rankingMachine learning techniquesClassificationRegressionClusteringActive learningCollaborative filteringImplementing Machine Learni

2、ng Machine learning algorithms are- Complex, multi-stage- Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing Spark RDDs efficient data sharing In-memory caching accelerates performance- Up to 20x faster than Hadoop Easy to use high-level programming interface- Express c

3、omplex algorithms 100 lines.Machine Learning using SparkMachine learning techniquesClassificationRegressionClusteringActive learningCollaborative filteringK-Means Clustering using SparkFocus: Implementation and PerformanceClusteringGrouping data according to similarityDistance EastDistance NorthE.g.

4、 archaeological digClusteringGrouping data according to similarityDistance EastDistance NorthE.g. archaeological digK-Means AlgorithmBenefits Popular Fast Conceptually straightforwardDistance EastDistance NorthE.g. archaeological digK-Means: preliminariesFeature 1Feature 2Data: Collection of valuesd

5、ata = lines.map(line=parseVector(line)K-Means: preliminariesFeature 1Feature 2Dissimilarity: Squared Euclidean distancedist = p.squaredDist(q)K-Means: preliminariesFeature 1Feature 2K = Number of clustersData assignments to clustersS1, S2,. . ., SKK-Means: preliminariesFeature 1Feature 2K = Number o

6、f clustersData assignments to clustersS1, S2,. . ., SKK-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.K-Means AlgorithmFeatu

7、re 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign

8、each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with

9、 the closest center. Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cl

10、uster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)cl

11、osest = data.map(p =(closestPoint(p,centers),p)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)K-M

12、eans AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)K-Means AlgorithmFeature 1Feature 2 Initialize K clus

13、ter centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest

14、= data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,cente

15、rs),p)pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupBy

16、Key()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues(ps = average(ps)while (dist(centers, newCenters) )K-Means AlgorithmFeature 1Featur

展开阅读全文