《MACHINE LEARNING ON SPARK - UC BERKELEY AMP CAMP》由会员分享,可在线阅读,更多相关《MACHINE LEARNING ON SPARK - UC BERKELEY AMP CAMP(34页珍藏版)》请在金锄头文库上搜索。
1、Machine Learning on SparkShivaram Venkataraman UC BerkeleyComputer Science Machine learningStatisticsMachine learningSpam filtersRecommendationsClick predictionSearch rankingMachine learning techniquesClassificationRegressionClusteringActive learningCollaborative filteringImplementing Machine Learni
2、ng Machine learning algorithms are- Complex, multi-stage- Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing Spark RDDs efficient data sharing In-memory caching accelerates performance- Up to 20x faster than Hadoop Easy to use high-level programming interface- Express c
3、omplex algorithms 100 lines.Machine Learning using SparkMachine learning techniquesClassificationRegressionClusteringActive learningCollaborative filteringK-Means Clustering using SparkFocus: Implementation and PerformanceClusteringGrouping data according to similarityDistance EastDistance NorthE.g.
4、 archaeological digClusteringGrouping data according to similarityDistance EastDistance NorthE.g. archaeological digK-Means AlgorithmBenefits Popular Fast Conceptually straightforwardDistance EastDistance NorthE.g. archaeological digK-Means: preliminariesFeature 1Feature 2Data: Collection of valuesd
5、ata = lines.map(line=parseVector(line)K-Means: preliminariesFeature 1Feature 2Dissimilarity: Squared Euclidean distancedist = p.squaredDist(q)K-Means: preliminariesFeature 1Feature 2K = Number of clustersData assignments to clustersS1, S2,. . ., SKK-Means: preliminariesFeature 1Feature 2K = Number o
6、f clustersData assignments to clustersS1, S2,. . ., SKK-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.K-Means AlgorithmFeatu
7、re 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign
8、each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with
9、 the closest center. Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cl
10、uster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)cl
11、osest = data.map(p =(closestPoint(p,centers),p)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)K-M
12、eans AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:Assign each cluster center to be the mean of its clusters data points.centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)K-Means AlgorithmFeature 1Feature 2 Initialize K clus
13、ter centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest
14、= data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,cente
15、rs),p)pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupBy
16、Key()newCenters = pointsGroup.mapValues(ps = average(ps)K-Means AlgorithmFeature 1Feature 2 Initialize K cluster centers Repeat until convergence:centers = data.takeSample(false, K, seed)closest = data.map(p =(closestPoint(p,centers),p)pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues(ps = average(ps)while (dist(centers, newCenters) )K-Means AlgorithmFeature 1Featur