Conquer Big Data through Spark

资源描述

《Conquer Big Data through Spark》由会员分享，可在线阅读，更多相关《Conquer Big Data through Spark（8页珍藏版）》请在金锄头文库上搜索。

1、Course Background:Apache Spark is a fast and general engine for large-scale data processing.Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. You can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.:Spark powers a

2、 stack of high-level tools including Spark SQL,MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application:You can run Spark readily using its standalone cluster mode, on EC2, or run it on Hadoop YARN or Apache Mesos. It can read from H

3、DFS, HBase, Cassandra, and any Hadoop data source:Write applications quickly in Java, Scala or Python.Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.Apache Spark has seen phenomenal adoption, being

4、 widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes.In the past few years ,Databricks, with the help of the Spark community, has contributed many improvements to Apache Spark to improve its performance, stability, and scalability.

5、This enabled Databricks to use Apache Spark to sort 100 TB of data on 206 machines in 23 minutes, which is 3X faster than the previous Hadoop 100TB result on 2100 machines. Similarly, Databricks sorted 1 PB of data on 190 machines in less than 4 hours, which is over 4X faster than the previous Hadoo

6、p 1PB result on 3800 machines. Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes. Spark enables equally dramatic improvements in time and cost for all Big Data users.Course Introduction:This course almost covers everything for Application

7、Developer to build diverse Spark applications to fulfill all kinds of business requirements: Architecture of Spark、the programming model in Spark、internals of Spark、 Spark SQL、MLlib 、GraphX 、 Spark Streaming、Testing 、 Tuning、Spark on Yarn、JobServer and SparkR.Additional，this course also covers the v

8、ery necessary skills you need to write Scala code in Spark, to help whom is not familiar with Scala.Who Needs to AttendAnyone who is interested in Big Data Development;Hadoop Developer；Other Big Data Developer;王家林老师(联系邮箱电话：18610086859 QQ:1740415547 微信号：18610086859)Spark 亚太研究院院长和首席专家，中国目前唯一的移动互联网和云

9、计算大数据集大成者。Spark 亚太研究院院长和首席专家，移动互联网、云计算和大数据技术领域集大成者。当今云计算领域最火爆的技术 Docker 源码级专家和 Docker 技术在中国的最早实践者之一。在 Spark、Hadoop、Android、 Docker 等方面有丰富的源码、实务和性能优化经验。彻底研究了Spark 从 0.5.0 到 1.1.0 共 18 个版本的 Spark 源码。Hadoop 源码级专家，曾负责某知名公司的类 Hadoop 框架开发工作，专注于 Hadoop 一站式解决方案的提供，同时也是云计算分布式大数据处理的最早实践者之一，Hadoop 的狂热爱好者，不断的在实

10、践中用 Hadoop 解决不同领域的大数据的高效处理和存储，现在正负责 Hadoop 在搜索引擎中的研发等，著有云计算分布式大数据 Hadoop 实战高手之路 -从零开始云计算分布式大数据Hadoop 实战高手之路-高手崛起云计算分布式大数据 Hadoop。实战高手之路-高手之巅等；多款浏览器定制者，中国大陆 HTML5 的技术引领者。为超过 50 家公司提供了基于 Linux 和 Android 的软硬整合解决方案。擅长构建系统和打造框架，特别精通于 Java 与 C/C+混合的框架实现。Android 架构师、高级工程师、咨询顾问、培训专家；通晓 Android、HTML5、Hadoo

11、p，迷恋英语播音和健美；致力于 Android、HTML5、Hadoop 的软、硬、云整合的一站式解决方案；国内最早（2007 年）从事于 Android 系统移植、软硬整合、框架修改、应用程序软件开发以及Android 系统测试和应用软件测试的技术专家和技术创业人员之一。HTML5 技术领域的最早实践者（2009 年）之一,成功为多个机构实现多款自定义 HTML5 浏览器，参与某知名的 HTML5 浏览器研发；超过 10 本的 IT 畅销书作者；决胜大数据时代 100 期公益大讲堂: http:/ Spark 实战高手之路完整系列课程: http:/ Spark 实战高手之路: http:/

12、 Spark 专刊: http:/ Spark 中文文档：http:/ PrerequisitesBe familiar with the basics of object-oriented programming;Course OutlineDay 1Class 1： The architecture of Spark1 Ecosystem of Spark2 Design of Spark3 RDD4 Fault-tolerance in SparkClass 2Programming with Scala1 Classes and Objects in Scala2 Funtional

13、Object3 Traits4 Case class and Pattern Matching5 Collections6 Implicit Conversions and Parameters7 Actors and ConcurrencyClass 3： Spark Programming Model1 RDD2 transformation3 action4 lineage5 DependencyClass 4： Spark Internals1 Spark Cluster2 Job Scheduling3 DAGScheduler4 TaskScheduler5 Task Intern

14、alTIME CONTENT NoteDay 2Class 5： Broadcasts and Accumulators1 Broadcast Internal2 Best practice in Broadcast3 Accumulators Internal4 Best practice in AccumulatorsClass 6： Action in programming Spark1 Data Source：File、HDFS、HBase、S3;2 IDEA3 Maven4 sbt.5 Code6 DeploymentClass 7： Deep in Spark Driver1 T

15、he Secret of SparkContext 2 The Secret of SparkConf4 The Secret of SparkEnvClass 8： Deep in RDD1 DAG2 Scala RDD Function3 Spark Java RDD Function4 RDD TuningTIME CONTENT NOTEDay 3Class 9： Machine Learning on Spark1 LinearRegression2 K-Means3 Collaborative FilteringClass 10: Graph Computation on Spark1 Table Operators2 Graph Operators3 GraphX AlgorithmsClass 11: Spark SQL1 Parquet、JSON、JDBC2 DSL3 SQL on RDDClass 12： Spark Streaming1 DStream2transformation3 checkpoint4 TuningTIME CONTENT NOTEDay 4Class 13： Spark

展开阅读全文