MAPREDUCE實踐篇（2）

發布時間：2020-06-13 21:06:49 來源：網絡閱讀：614 作者：yushiwh 欄目：開發技術

4.1. Mapreduce中的排序初步

4.1.1 需求

對日志數據中的上下行流量信息匯總，并輸出按照總流量倒序排序的結果

數據如下：

1363157985066 1372623050300-FD-07-A4-72-B8:CMCC120.196.100.82 2427248124681200

1363157995052 138265441015C-0E-8B-C7-F1-E0:CMCC120.197.40.4402640200

1363157991076 1392643565620-10-7A-28-CC-0A:CMCC120.196.100.99241321512200

1363154400022 139262511065C-0E-8B-8B-B1-50:CMCC120.197.40.4402400200

4.1.2 分析

基本思路：實現自定義的bean來封裝流量信息，并將bean作為map輸出的key來傳輸

MR程序在處理數據的過程中會對數據排序(map輸出的kv對傳輸到reduce之前，會排序)，排序的依據是map輸出的key

所以，我們如果要實現自己需要的排序規則，則可以考慮將排序因素放到key中，讓key實現接口：WritableComparable

然后重寫key的compareTo方法

4.1.3 實現

1、自定義的bean

public class FlowBean implements WritableComparable<FlowBean>{

long upflow;

long downflow;

long sumflow;

//如果空參構造函數被覆蓋，一定要顯示定義一下，否則在反序列時會拋異常

public FlowBean(){}

public FlowBean(long upflow, long downflow) {

super();

this.upflow = upflow;

this.downflow = downflow;

this.sumflow = upflow + downflow;

}

public long getSumflow() {

return sumflow;

}

public void setSumflow(long sumflow) {

this.sumflow = sumflow;

}

public long getUpflow() {

return upflow;

}

public void setUpflow(long upflow) {

this.upflow = upflow;

}

public long getDownflow() {

return downflow;

}

public void setDownflow(long downflow) {

this.downflow = downflow;

}

//序列化，將對象的字段信息寫入輸出流

@Override

public void write(DataOutput out) throws IOException {

out.writeLong(upflow);

out.writeLong(downflow);

out.writeLong(sumflow);

}

//反序列化，從輸入流中讀取各個字段信息

@Override

public void readFields(DataInput in) throws IOException {

upflow = in.readLong();

downflow = in.readLong();

sumflow = in.readLong();

}

@Override

public String toString() {

return upflow + "\t" + downflow + "\t" + sumflow;

}

@Override

public int compareTo(FlowBean o) {

//自定義倒序比較規則

return sumflow > o.getSumflow() ? -1:1;

}

2、 mapper 和 reducer

public class FlowCount {

static class FlowCountMapper extends Mapper<LongWritable, Text, FlowBean,Text > {

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

String[] fields = line.split("\t");

try {

String phonenbr = fields[0];

long upflow = Long.parseLong(fields[1]);

long dflow = Long.parseLong(fields[2]);

FlowBean flowBean = new FlowBean(upflow, dflow);

context.write(flowBean,new Text(phonenbr));

} catch (Exception e) {

e.printStackTrace();

}

static class FlowCountReducer extends Reducer<FlowBean,Text,Text, FlowBean> {

@Override

protected void reduce(FlowBean bean, Iterable<Text> phonenbr, Context context) throws IOException, InterruptedException {

Text phoneNbr = phonenbr.iterator().next();

context.write(phoneNbr, bean);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

job.setJarByClass(FlowCount.class);

job.setMapperClass(FlowCountMapper.class);

job.setReducerClass(FlowCountReducer.class);

job.setMapOutputKeyClass(FlowBean.class);

job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(FlowBean.class);

// job.setInputFormatClass(TextInputFormat.class);

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

4.2. Mapreduce中的分區Partitioner

4.2.1 需求

根據歸屬地輸出流量統計數據結果到不同文件，以便于在查詢統計結果時可以定位到省級范圍進行

4.2.2 分析

Mapreduce中會將map輸出的kv對，按照相同key分組，然后分發給不同的reducetask

默認的分發規則為：根據key的hashcode%reducetask數來分發

所以：如果要按照我們自己的需求進行分組，則需要改寫數據分發（分組）組件Partitioner

自定義一個CustomPartitioner繼承抽象類：Partitioner

然后在job對象中，設置自定義partitioner： job.setPartitionerClass(CustomPartitioner.class)

4.2.3 實現

/**

* 定義自己的從map到reduce之間的數據（分組）分發規則按照手機號所屬的省份來分發（分組）ProvincePartitioner

* 默認的分組組件是HashPartitioner

* @author

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {

static HashMap<String, Integer> provinceMap = new HashMap<String, Integer>();

static {

provinceMap.put("135", 0);

provinceMap.put("136", 1);

provinceMap.put("137", 2);

provinceMap.put("138", 3);

provinceMap.put("139", 4);

}

@Override

public int getPartition(Text key, FlowBean value, int numPartitions) {

Integer code = provinceMap.get(key.toString().substring(0, 3));

return code == null ? 5 : code;

}

4.3. mapreduce數據壓縮

4.3.1 概述

這是mapreduce的一種優化策略：通過壓縮編碼對mapper或者reducer的輸出進行壓縮，以減少磁盤IO，提高MR程序運行速度（但相應增加了cpu運算負擔）

1、 Mapreduce支持將map輸出的結果或者reduce輸出的結果進行壓縮，以減少網絡IO或最終輸出數據的體積

2、壓縮特性運用得當能提高性能，但運用不當也可能降低性能

3、基本原則：

運算密集型的job，少用壓縮

IO密集型的job，多用壓縮

4.3.2 MR支持的壓縮編碼

MAPREDUCE實踐篇（2）

4.3.3 Reducer輸出壓縮

在配置參數或在代碼中都可以設置reduce的輸出壓縮

1、在配置參數中設置

mapreduce.output.fileoutputformat.compress=false

mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

mapreduce.output.fileoutputformat.compress.type=RECORD

2、在代碼中設置

Job job = Job.getInstance(conf);

FileOutputFormat.setCompressOutput(job, true);

FileOutputFormat.setOutputCompressorClass(job, (Class<? extends CompressionCodec>) Class.forName(""));

4.3.4 Mapper輸出壓縮

在配置參數或在代碼中都可以設置reduce的輸出壓縮

1、在配置參數中設置

mapreduce.map.output.compress=false

mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

2、在代碼中設置：

conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);

conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, GzipCodec.class, CompressionCodec.class);

4.3.5 壓縮文件的讀取

Hadoop自帶的InputFormat類內置支持壓縮文件的讀取，比如TextInputformat類，在其initialize方法中：

public void initialize(InputSplit genericSplit,

TaskAttemptContext context) throws IOException {

FileSplit split = (FileSplit) genericSplit;

Configuration job = context.getConfiguration();

this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);

start = split.getStart();

end = start + split.getLength();

final Path file = split.getPath();

// open the file and seek to the start of the split

final FileSystem fs = file.getFileSystem(job);

fileIn = fs.open(file);

//根據文件后綴名創建相應壓縮編碼的codec

CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);

if (null!=codec) {

isCompressedInput = true;

decompressor = CodecPool.getDecompressor(codec);

//判斷是否屬于可切片壓縮編碼類型

if (codec instanceof SplittableCompressionCodec) {

final SplitCompressionInputStream cIn =

((SplittableCompressionCodec)codec).createInputStream(

fileIn, decompressor, start, end,

SplittableCompressionCodec.READ_MODE.BYBLOCK);

//如果是可切片壓縮編碼，則創建一個CompressedSplitLineReader讀取壓縮數據

in = new CompressedSplitLineReader(cIn, job,

this.recordDelimiterBytes);

start = cIn.getAdjustedStart();

end = cIn.getAdjustedEnd();

filePosition = cIn;

} else {

//如果是不可切片壓縮編碼，則創建一個SplitLineReader讀取壓縮數據，并將文件輸入流轉換成解壓數據流傳遞給普通SplitLineReader讀取

in = new SplitLineReader(codec.createInputStream(fileIn,

decompressor), job, this.recordDelimiterBytes);

filePosition = fileIn;

}

} else {

fileIn.seek(start);

//如果不是壓縮文件，則創建普通SplitLineReader讀取數據

in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);

filePosition = fileIn;

}

4.4. 更多MapReduce編程案例

4.4.1 reduce端join算法實現

1、需求：

訂單數據表t_order：

id	date	pid	amount
1001	20150710	P0001	2
1002	20150710	P0001	3
1002	20150710	P0002	3

商品信息表t_product

id	name	category_id	price
P0001	小米5	C01	2
P0002	錘子T1	C01	3

假如數據量巨大，兩表的數據是以文件的形式存儲在HDFS中，需要用mapreduce程序來實現一下SQL查詢運算：

select a.id,a.date,b.name,b.category_id,b.price from t_order a join t_product b on a.pid = b.id

2、實現機制：

通過將關聯的條件作為map輸出的key，將兩表滿足join條件的數據并攜帶數據所來源的文件信息，發往同一個reduce task，在reduce中進行數據的串聯

public class OrderJoin {

static class OrderJoinMapper extends Mapper<LongWritable, Text, Text, OrderJoinBean> {

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

// 拿到一行數據，并且要分辨出這行數據所屬的文件

String line = value.toString();

String[] fields = line.split("\t");

// 拿到itemid

String itemid = fields[0];

// 獲取到這一行所在的文件名（通過inpusplit）

String name = "你拿到的文件名";

// 根據文件名，切分出各字段（如果是a，切分出兩個字段，如果是b，切分出3個字段）

OrderJoinBean bean = new OrderJoinBean();

bean.set(null, null, null, null, null);

context.write(new Text(itemid), bean);

}

static class OrderJoinReducer extends Reducer<Text, OrderJoinBean, OrderJoinBean, NullWritable> {

@Override

protected void reduce(Text key, Iterable<OrderJoinBean> beans, Context context) throws IOException, InterruptedException {

//拿到的key是某一個itemid,比如1000

//拿到的beans是來自于兩類文件的bean

// {1000,amount} {1000,amount} {1000,amount} --- {1000,price,name}

//將來自于b文件的bean里面的字段，跟來自于a的所有bean進行字段拼接并輸出

}

缺點：這種方式中，join的操作是在reduce階段完成，reduce端的處理壓力太大，map節點的運算負載則很低，資源利用率不高，且在reduce階段極易產生數據傾斜

解決方案： map端join實現方式

4.4.2 map端join算法實現

1、原理闡述

適用于關聯表中有小表的情形；

可以將小表分發到所有的map節點，這樣，map節點就可以在本地對自己所讀到的大表數據進行join并輸出最終結果，可以大大提高join操作的并發度，加快處理速度

2、實現示例

--先在mapper類中預先定義好小表，進行join

--引入實際場景中的解決方案：一次加載數據庫或者用distributedcache

public class TestDistributedCache {

static class TestDistributedCacheMapper extends Mapper<LongWritable, Text, Text, Text>{

FileReader in = null;

BufferedReader reader = null;

HashMap<String,String> b_tab = new HashMap<String, String>();

String localpath =null;

String uirpath = null;

//是在map任務初始化的時候調用一次

@Override

protected void setup(Context context) throws IOException, InterruptedException {

//通過這幾句代碼可以獲取到cache file的本地絕對路徑，測試驗證用

Path[] files = context.getLocalCacheFiles();

localpath = files[0].toString();

URI[] cacheFiles = context.getCacheFiles();

//緩存文件的用法——直接用本地IO來讀取

//這里讀的數據是map task所在機器本地工作目錄中的一個小文件

in = new FileReader("b.txt");

reader =new BufferedReader(in);

String line =null;

while(null!=(line=reader.readLine())){

String[] fields = line.split(",");

b_tab.put(fields[0],fields[1]);

}

IOUtils.closeStream(reader);

IOUtils.closeStream(in);

}

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

//這里讀的是這個map task所負責的那一個切片數據（在hdfs上）

String[] fields = value.toString().split("\t");

String a_itemid = fields[0];

String a_amount = fields[1];

String b_name = b_tab.get(a_itemid);

// 輸出結果 100198.9banan

context.write(new Text(a_itemid), new Text(a_amount + "\t" + ":" + localpath + "\t" +b_name ));

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

job.setJarByClass(TestDistributedCache.class);

job.setMapperClass(TestDistributedCacheMapper.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

//這里是我們正常的需要處理的數據所在路徑

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

//不需要reducer

job.setNumReduceTasks(0);

//分發一個文件到task進程的工作目錄

job.addCacheFile(new URI("hdfs://hadoop-server01:9000/cachefile/b.txt"));

//分發一個歸檔文件到task進程的工作目錄

//job.addArchiveToClassPath(archive);

//分發jar包到task節點的classpath下

//job.addFileToClassPath(jarfile);

job.waitForCompletion(true);

}

4.4.3 web日志預處理

1、需求：

對web訪問日志中的各字段識別切分

去除日志中不合法的記錄

根據KPI統計需求，生成各類訪問請求過濾數據

2、實現代碼：

a) 定義一個bean，用來記錄日志數據中的各數據字段

public class WebLogBean {

private String remote_addr;// 記錄客戶端的ip地址

private String remote_user;// 記錄客戶端用戶名稱,忽略屬性"-"

private String time_local;// 記錄訪問時間與時區

private String request;// 記錄請求的url與http協議

private String status;// 記錄請求狀態；成功是200

private String body_bytes_sent;// 記錄發送給客戶端文件主體內容大小

private String http_referer;// 用來記錄從那個頁面鏈接訪問過來的

private String http_user_agent;// 記錄客戶瀏覽器的相關信息

private boolean valid = true;// 判斷數據是否合法

public String getRemote_addr() {

return remote_addr;

}

public void setRemote_addr(String remote_addr) {

this.remote_addr = remote_addr;

}

public String getRemote_user() {

return remote_user;

}

public void setRemote_user(String remote_user) {

this.remote_user = remote_user;

}

public String getTime_local() {

return time_local;

}

public void setTime_local(String time_local) {

this.time_local = time_local;

}

public String getRequest() {

return request;

}

public void setRequest(String request) {

this.request = request;

}

public String getStatus() {

return status;

}

public void setStatus(String status) {

this.status = status;

}

public String getBody_bytes_sent() {

return body_bytes_sent;

}

public void setBody_bytes_sent(String body_bytes_sent) {

this.body_bytes_sent = body_bytes_sent;

}

public String getHttp_referer() {

return http_referer;

}

public void setHttp_referer(String http_referer) {

this.http_referer = http_referer;

}

public String getHttp_user_agent() {

return http_user_agent;

}

public void setHttp_user_agent(String http_user_agent) {

this.http_user_agent = http_user_agent;

}

public boolean isValid() {

return valid;

}

public void setValid(boolean valid) {

this.valid = valid;

}

@Override

public String toString() {

StringBuilder sb = new StringBuilder();

sb.append(this.valid);

sb.append("\001").append(this.remote_addr);

sb.append("\001").append(this.remote_user);

sb.append("\001").append(this.time_local);

sb.append("\001").append(this.request);

sb.append("\001").append(this.status);

sb.append("\001").append(this.body_bytes_sent);

sb.append("\001").append(this.http_referer);

sb.append("\001").append(this.http_user_agent);

return sb.toString();

}

b)定義一個parser用來解析過濾web訪問日志原始記錄

public class WebLogParser {

public static WebLogBean parser(String line) {

WebLogBean webLogBean = new WebLogBean();

String[] arr = line.split(" ");

if (arr.length > 11) {

webLogBean.setRemote_addr(arr[0]);

webLogBean.setRemote_user(arr[1]);

webLogBean.setTime_local(arr[3].substring(1));

webLogBean.setRequest(arr[6]);

webLogBean.setStatus(arr[8]);

webLogBean.setBody_bytes_sent(arr[9]);

webLogBean.setHttp_referer(arr[10]);

if (arr.length > 12) {

webLogBean.setHttp_user_agent(arr[11] + " " + arr[12]);

} else {

webLogBean.setHttp_user_agent(arr[11]);

}

if (Integer.parseInt(webLogBean.getStatus()) >= 400) {// 大于400，HTTP錯誤

webLogBean.setValid(false);

}

} else {

webLogBean.setValid(false);

}

return webLogBean;

}

public static String parserTime(String time) {

time.replace("/", "-");

return time;

}

c) mapreduce程序

public class WeblogPreProcess {

static class WeblogPreProcessMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

Text k = new Text();

NullWritable v = NullWritable.get();

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

WebLogBean webLogBean = WebLogParser.parser(line);

if (!webLogBean.isValid())

return;

k.set(webLogBean.toString());

context.write(k, v);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

job.setJarByClass(WeblogPreProcess.class);

job.setMapperClass(WeblogPreProcessMapper.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(NullWritable.class);

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

MAPREDUCE實踐篇（2）

4.1. Mapreduce中的排序初步

4.1.1 需求

4.1.2 分析

4.1.3 實現

4.2. Mapreduce中的分區Partitioner

4.2.1 需求

4.2.2 分析

4.2.3 實現

4.3. mapreduce數據壓縮

4.3.1 概述

4.3.2 MR支持的壓縮編碼

4.3.3 Reducer輸出壓縮

4.3.4 Mapper輸出壓縮

4.3.5 壓縮文件的讀取

4.4. 更多MapReduce編程案例

4.4.1 reduce端join算法實現

4.4.2 map端join算法實現

4.4.3 web日志預處理

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

MAPREDUCE實踐篇（2）

4.1. Mapreduce中的排序初步

4.1.1 需求

4.1.2 分析

4.1.3 實現

4.2. Mapreduce中的分區Partitioner

4.2.1 需求

4.2.2 分析

4.2.3 實現

4.3. mapreduce數據壓縮

4.3.1 概述

4.3.2 MR支持的壓縮編碼

4.3.3 Reducer輸出壓縮

4.3.4 Mapper輸出壓縮

4.3.5 壓縮文件的讀取

4.4. 更多MapReduce編程案例

4.4.1 reduce端join算法實現

4.4.2 map端join算法實現

4.4.3 web日志預處理

猜你喜歡

最新資訊

相關推薦

相關標簽