第19課：Spark高級排序徹底解密

發布時間：2020-07-04 03:30:43 來源：網絡閱讀：1451 作者：Spark_2016 欄目：大數據

本節課內容：

1、基礎排序算法實戰

2、二次排序算法實戰

3、更高級別排序算法

4、排序算法內幕解密

排序在Spark運用程序中使用的比較多，且維度也不一樣，如二次排序，三次排序等，在機器學習算法中經常碰到，所以非常重要，必須掌握！

所謂二次排序，就是根據兩列值進行排序，如下測試數據：

2 3

4 1

3 2

4 3

8 7

2 1

經過二次排序后的結果（升序）：

2 1

2 3

3 2

4 1

4 3

8 7

在編寫二次排序代碼前，先簡單的寫下單個key排序的代碼：

val conf = new SparkConf().setAppName("SortByKey").setMaster("local")

val sc = new SparkContext(conf)

val lines = sc.textFile("C:\\User\\Test.txt")

valwords = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

val wordcount = words.map(word=>(word._2,word._1)).sortByKey(false).map(word=>(word._2,word._1))

wordcount.collect().foreach(println)

以上就是簡單的wordcount程序，程序中使用了sortByKey排序

下面我們通過代碼實現二次排序算法

首先我們先通過Java代碼實現上面測試數據進行二次排序

排序最主要的就是Key的準備，我們先用Java編寫二次排序的key，參考代碼如下：

import java.io.Serializable;

import scala.math.Ordered;

public class SecondarySortKey implements Ordered<SecondarySortKey>, Serializable {

private int first;

private int second;

@Override

public int hashCode() {

final int prime = 31;

int result = 1;

result = prime * result + first;

result = prime * result + second;

return result;

}

@Override

public boolean equals(Object obj) {

if (this == obj)

return true;

if (obj == null)

return false;

if (getClass() != obj.getClass())

return false;

SecondarySortKey other = (SecondarySortKey) obj;

if (first != other.first)

return false;

if (second != other.second)

return false;

return true;

}

public int getFirst() {

return first;

}

public void setFirst(int first) {

this.first = first;

}

public int getSecond() {

return second;

}

public void setSecond(int second) {

this.second = second;

}

public SecondarySortKey(int first, int second) {

this.first = first;

this.second = second;

}

public boolean $greater(SecondarySortKey other) {

if (this.first > other.getFirst()) {

return true;

} else if (this.first == other.getFirst() && this.second > other.getSecond()) {

return true;

}

return false;

}

public boolean $greater$eq(SecondarySortKey other) {

if (this.$greater(other)) {

return true;

} else if (this.first == other.getFirst() && this.second == other.getSecond()) {

return true;

}

return false;

}

public boolean $less(SecondarySortKey other) {

if (this.first < other.getFirst()) {

return true;

} else if (this.first == other.getFirst() && this.second < other.getSecond()) {

return true;

}

return false;

}

public boolean $less$eq(SecondarySortKey other) {

if (this.$less(other)) {

return true;

} else if (this.first == other.getFirst() && this.second < other.getSecond()) {

return true;

}

return false;

}

public int compare(SecondarySortKey other) {

if (this.first - other.getFirst() != 0) {

return this.first - other.getFirst();

} else {

return this.second - other.getSecond();

}

public int compareTo(SecondarySortKey other) {

if (this.first - other.getFirst() != 0) {

return this.first - other.getFirst();

} else {

return this.second - other.getSecond();

}

根據上面生成的排序key編寫對測試數據的二次排序

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

/**

* DT_Spark大數據夢工廠

* 二次排序，具體的實現步驟：

* 第一步：按照Ordered和Serializable接口實現自定義排序的key

* 第二步：將要進行二次排序的文件加載進來生成<key,value>類型的RDD

* 第三步：使用sortByKey基于自定義的Key進行二次排序

* 第四步：去除掉排序的Key,只保留排序的結果

public class SecondarySortKeyApp {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("SecondarySortKeyApp").setMaster("local");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD<String> lines = sc.textFile("C:\\Users\\Test.txt");

//將自定義的key添加進來

JavaPairRDD<SecondarySortKey, String> pairs = lines

.mapToPair(new PairFunction<String, SecondarySortKey, String>() {

private static final long serialVersionUID = 1L;

public Tuple2<SecondarySortKey, String> call(String line) throws Exception {

String[] splited = line.split(" ");

SecondarySortKey key = new SecondarySortKey(Integer.valueOf(splited[0]),

Integer.valueOf(splited[1]));

return new Tuple2<SecondarySortKey, String>(key, line);

}

});

//根據我們自定義的key進行升序排序

JavaPairRDD<SecondarySortKey, String> sorted = pairs.sortByKey(); //sortByKey(false) 降序

// 過濾掉排序后的自定義的Key,保留排序的結果

JavaRDD<String> secondarySort = sorted.map(new Function<Tuple2<SecondarySortKey, String>, String>() {

public String call(Tuple2<SecondarySortKey, String> sortedContent) throws Exception {

return sortedContent._2;

}

});

secondarySort.foreach(new VoidFunction<String>() {

public void call(String sorted) throws Exception {

System.out.println(sorted);

}

});

}

運行結果：

2 1

2 3

3 2

4 1

4 3

8 7

下面我通過Scala方式實現上述二次排序，scala代碼非常簡潔

先創建我們自定義排序key

/**

*DT_Spark大數據夢工廠
* 自定義二次排序的key
*/
class SecondarySortKey(val first: Int, val second: Int) extends Ordered[SecondarySortKey] with Serializable {
def compare(other: SecondarySortKey): Int = {
if (this.first - other.first != 0) {
this.first - other.first
}
else {
this.second - other.second
}
}

根據自定義排序Key實現二次排序

import org.apache.spark.{SparkConf, SparkContext}

/**
* 二次排序，具體的實現步驟：
* 第一步：按照Ordered和Serializable接口實現自定義排序的key
* 第二步：將要進行的二次排序的文件加載進來生成<key,value>類型的RDD
* 第三步：使用sortByKey基于自定義的Key進行二次排序第
* 四步：去除掉排序的Key,只保留排序的結果
*/

object

SecondarySortKeyApp {

def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SecondarySortKeyApp").setMaster("local");
val sc = new SparkContext(conf)
val lines = sc.textFile("C:\\Users\\Test.txt")//添加key，組合成（key,value)格式
val pairWithSortKey = lines.map(line => (new SecondarySortKey(line.split(" ")(0).toInt,line.split(" ")(1).toInt),line))
//格式：Tuple2(key,value)
val sorted = pairWithSortKey.sortByKey()

//過濾掉key,只保留value
val sortedResult = sorted.map(sort => sort._2)

//顯示結果
sortedResult.collect().foreach(println)

}

運行結果：

2 1

2 3

3 2

4 1

4 3

8 7

從上面的代碼可以看出，通過scala代碼實現二次排序確實非常簡潔，這也是scala的強大之處所在。

更高級別排序算法和內幕解密在后續課程中在分享。

備注：

資料來源于：DT_大數據夢工廠

更多私密內容，請關注微信公眾號：DT_Spark

如果您對大數據Spark感興趣，可以免費聽由王家林老師每天晚上20：00開設的Spark永久免費公開課，地址YY房間號：68917580

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

第19課：Spark高級排序徹底解密

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

第19課：Spark高級排序徹底解密

猜你喜歡

最新資訊

相關推薦

相關標簽