【Big Data】Spark - 2：Spark 簡介 - RDD, DAG

Berkeley Data Analytics Stack

Spark 是屬於 Berkeley Data Analytics Stack (BDAS) 中的子系統
BDAS 的目標就是要打造一套與現有 Hadoop 相容，速度更快，更方便使用的系統，每一個子系統都可以單獨運作

Data Processing Goals

Low latency 延遲 (interactive) queries on historical data：enable faster decisions
- 例如：identify why a site is slow and fix it
Low latency queries on live data (streaming): enable decisions on real-time data 即時資料
- 例如：detect & block worms 蠕蟲 in real-time (a worm may infect 1mil hosts in 1.3sec)
Sophisticated 老練的 data processing: enable “better” decisions
- 例如：anomaly detection 異常偵測, trend analysis 趨勢分析

Spark

分散式運算系統，專門為資料分析所設計，在記憶體中進行迭代與交互的運算
Speed：100x faster than Hadoop MapReduce in memory
- MapReduce運算花費太多的磁碟I/O
運算方式 (後面會有詳細介紹)
- local mode
- Standalone cluster mode
- Yarn-Client mode
- Yarn-Cluster mode
支援語言：Scala, Java, Python, R

Scheduling Process

DAG(Directed Acyclic Graph) 有向無環圖
- eliminate the MapReduce multi-stage execution model
- 代表了從輸入RDD到結果RDD的變換關係
RDD(Resilient Distributed Datasets) 彈性分散式資料集
- 具有容錯(tolerant) 與高效能(efficient)的抽象資料結構
- parallelized collections
- Hadoop datasets

RDD

RDD能與其他系統相容，可以匯入外部儲存系統的資料集
- 例如：HDFS、HBase或其他 Hadoop 資料來源
特性：immutable, distributed, parallel exection
紀錄：
- Lineage 血統關係
- Optimized Execution

RDD 類型

	說明
Transformation	1. RDD執行Transformation，會產生另外一個RDD 2. 由於RDD的lazy特性，Transformation並不會立刻執行，它會等到Action，才會執行
Action	1. RDD執行Action後，不會產生另外一個RDD，它會產生數值、陣列或寫入檔案系統 2. 連同之前的Transformation運算一併執行
Persistence	對於那些會重複使用的RDD，可以將RDD Persistence 在記憶體中做為後續使用，以加快執行效能

More Detail 文件說明
Fault Tolerance
- RDDs maintain lineage 血統關係 information that can be used to reconstruct lost partitions

比較：Hadoop, Spark

	I/O
Hadoop MapReduce
Spark

MapReduce 每次Read, Write，都是直接對硬碟執行
- 一個複雜的任務，就會有十幾次的硬碟讀寫
- I/O 和 Serializaiton 很耗時間
Spark 只需要對硬碟進行1次的讀寫，大部分都在RAM中進行運算

Extensive 延伸閱讀

Hadoop ecosystem 工具簡介, 安裝教學與各種情境使用(中文)
https://ithelp.ithome.com.tw/users/20107349/ironman/1309
Spark 2.0 in Scala(中文)：https://ithelp.ithome.com.tw/users/20103839/ironman/1210
cloudera 線上課程：https://www.cloudera.com/more/training.html
何謂DAG：http://www.csie.ntnu.edu.tw/~u91029/DirectedAcyclicGraph.html
Spark RDD介紹與範例指令：http://hadoopspark.blogspot.com/2015/09/9-rddresilient-distributed-dataset.html

Reference 參考資料

Spark 官網：http://spark.apache.org/
上課講義：https://tims.etraining.gov.tw/TIMSonline/index3.aspx?OCID=113442
Berkeley Data Analytics Stack：https://slideplayer.com/slide/4499532/
Apache Spark 入門(1)：https://ithelp.ithome.com.tw/articles/10198318
Introduction to Hadoop, MapReduce, and Apache Spark
https://slideplayer.com/slide/9521933/
Spark Internals - Hadoop Source Code Reading #16 in Japan
https://www.slideshare.net/taroleo/spark-internal-hadoop-source-code-reading-16-in-japan

Author Description

站長留言

【Big Data】Spark - 2：Spark 簡介 - RDD, DAG

Berkeley Data Analytics Stack

Data Processing Goals

Spark

Scheduling Process

RDD

比較：Hadoop, Spark

Extensive 延伸閱讀

Reference 參考資料

SpicyBoyd

沒有留言:

張貼留言

若這些文章對你有幫助，歡迎贊助

站內搜尋

熱門文章

文章分類

文章總表

本站網頁

訪客留言

聯絡我

本站錯誤回報

快速連結

站外連結

瀏覽次數

訂閱

Author Description

Author Social Links

站長留言

【Big Data】Spark - 2：Spark 簡介 - RDD, DAG

Berkeley Data Analytics Stack

Data Processing Goals

Spark

Scheduling Process

RDD

比較：Hadoop, Spark

Extensive 延伸閱讀

Reference 參考資料

SpicyBoyd

沒有留言:

張貼留言

若這些文章對你有幫助，歡迎贊助

站內搜尋

熱門文章

文章分類

文章總表

本站網頁

訪客留言

聯絡我

本站錯誤回報

快速連結

站外連結

瀏覽次數

訂閱