Spark的基礎介紹和操作調優

發布時間：2021-09-14 01:23:54 來源：億速云閱讀：115 作者：chen 欄目：云計算

本篇內容介紹了“Spark的基礎介紹和操作調優”的有關知識，在實際案例的操作過程中，不少人都會遇到這樣的困境，接下來就讓小編帶領大家學習一下如何處理這些情況吧！希望大家仔細閱讀，能夠學有所成！

Spark 基礎介紹

在討論spark調優之前，先看看spark里的一些概念。

action

Action是得到非RDD結果的RDD操作。如Spark中有如下常見action操作： reduce, collect, count, first, take, takeSample, countByKey, saveAsTextFile

job

每個spark的action會被分解成一個job。

stage

一個job會被分成多組task，每組task稱為一個stage。stage的劃分界限為以下兩種task之一：

shuffleMapTask - 所有的wide transformation之前，可以簡單認為是shuffle之前
resultTask - 可以簡單認為是take()之類的操作

partition

RDD 包含固定數目的 partition，每個 partiton 包含若干的 record。

narrow tansformation （比如 map 和 filter）返回的 RDD，一個 partition 中的 record 只需要從父 RDD 對應的 partition 中的 record 計算得到。同樣narrow transformation不會改變partition的個數。

task

被送到executor上執行的工作單元；一個task只能做一個stage中的一個partition的數據。 Spark的基礎介紹和操作調優

操作調優

調整在 stage 邊屆時的 partition 個數經常可以很大程度上影響程序的執行效率;
associative reductive operation，能使用reduceByKey時不使用groupByKey，因為grouByKey會把所有數據shuffle一遍，而reduceByKey只會Shuffle reduce的結果。
輸入和輸出結果不一樣時，不使用reduceByKey,而使用aggregateByKey;

aggregateByKey: Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

不要用flatMap-join-groupBy的模式，可以用cogroup;
當兩個reduceByKey的結果join時，如果大家的partition都一樣，則spark不會在join時做shuffle;
當一個內存能放得下的數據集join時，可以考慮broadcast而不使用join；

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

資源調優

spark中的資源可以簡單歸結為CPU和內存，而以下的參數會影響內存和CPU的使用。

executor 越大并行性越好，越大每個executor所有的內存就越小；
core，越大并行性越好;

HDFS client 在大量并發線程是時性能問題。大概的估計是每個 executor 中最多5個并行的 task 就可以占滿的寫入帶寬。

partition,如果比excutor*core小則很傻；越多每個partition占用的內存就越少；足夠大以后對性能提升不再有用。

我naive的認為應該這樣調整：

core = min(5,cpu核數)；
executor = instance數 * cpu核數 / core
平均每instance的executor個數決定executor.memory,從而決定shuffle.memory和storage.memory;
估計總數據量，即最大的shuffle時的數據大小（spark driver運行記錄中會有shuffle size）;
用4的結果除以3得到partition數，如果很小，把partition設成和(executor*core)的若干倍.

“Spark的基礎介紹和操作調優”的內容就介紹到這里了，感謝大家的閱讀。如果想了解更多行業相關的知識可以關注億速云網站，小編將為大家輸出更多高質量的實用文章！

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

Spark的基礎介紹和操作調優

Spark 基礎介紹

action

job

stage

partition

task

操作調優

資源調優

我naive的認為應該這樣調整：

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

Spark的基礎介紹和操作調優

Spark 基礎介紹

action

job

stage

partition

task

操作調優

資源調優

我naive的認為應該這樣調整：

猜你喜歡

最新資訊

相關推薦

相關標簽