MapReduce設計模式有哪些

發布時間：2022-01-04 10:59:32 來源：億速云閱讀：178 作者：iii 欄目：云計算

本篇內容主要講解“MapReduce設計模式有哪些”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實用性強。下面就讓小編來帶大家學習“MapReduce設計模式有哪些”吧!

1 (總計)Summarization Patterns

1.1（數字統計）Numerical Summarizations

這個算是Built-in的,因為這就是MapReduce的模式. 相當于SQL語句里邊Count/Max,WordCount也是這個的實現。

1.2（反向索引）Inverted Index Summarizations

這個看著名字很玄，其實感覺算不上模式，只能算是一種應用，并沒有涉及到MapReduce的設計。其核心實質是對listof(V3)的索引處理，這是V3是一個引用Id。這個模式期望的結果是：
url-〉list of id

1.3（計數器統計）Counting with Counters

計數器很好很快，簡單易用。不過代價是占用tasktracker，最重要使jobtracker的內存。所以在1.0時代建議tens，至少<100個。不過2.0時代，jobtracker變得per job，我看應該可以多用，不過它比較適合Counting這種算總數的算法。
context.getCounter(STATE_COUNTER_GROUP, UNKNOWN_COUNTER).increment(1);

2 (過濾)Filtering Patterns

2.1（簡單過濾）Filtering

這個也算是Built-in的,因為這就是MapReduce中Mapper如果沒有Write，那么就算過濾掉

了. 相當于SQL語句里邊Where。

map(key, record):
    if we want to keep record then
    emit key,value

2.2（Bloom過濾）Bloom Filtering

以前我一直不知道為什么叫BloomFilter，看了wiki后，才知道，貼過來大家瞧瞧：
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate.
其原理可以參見這篇文章：

http://blog.csdn.net/jiaomeng/article/details/1495500
要是讓我一句話說，就是根據集合內容，選取多種Hash做一個bitmap，那么如果一個詞的 hash落在map中，那么它有可能是，也有可能不是。但是如果它的hash不在，則它一定沒有落在里邊。此過濾有點意思，在HBase中得到廣泛應用。接下來得實際試驗一下。

Note: 需要弄程序玩玩

2.3（Top N）Top Ten

這是一個典型的計算Top的操作，類似SQL里邊的top或limit，一般都是帶有某條件的top

操作。
算法實現：我喜歡偽代碼，一目了然：

class mapper:
    setup():
        initialize top ten sorted list
     
    map(key, record):
        insert record into top ten sorted list
        if length of array is greater-than 10 then
        truncate list to a length of 10

    cleanup():
        for record in top sorted ten list:
        emit null,record

class reducer:
    setup():
        initialize top ten sorted list

    reduce(key, records):
        sort records
        truncate records to top 10
        for record in records:
            emit record

2.4（排重）Distinct

這個模式也簡單，就是利用MapReduce的Reduce階段，看struct，一目了然：

map(key, record):
    emit record,null

reduce(key, records):
    emit key

3 (數據組織)Data Organization Patterns

3.1（結構化到層級化）Structured to Hierarchical

這個在算法上是join操作,在應用層面可以起到Denormalization的效果.其程序的關鍵之處是用到了MultipleInputs,可以引入多個Mapper,這樣便于把多種Structured的或者任何格式的內容,聚合在reducer端,以前進行聚合為Hierarchical的格式.
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, PostMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, CommentMapper.class);
在Map輸出的時候,這里有一個小技巧,就是把輸出內容按照分類,添加了前綴prefix,這樣在Reduce階段,就可以知道數據來源,以更好的進行笛卡爾乘積或者甄別操作. 從技術上講這樣節省了自己寫Writable的必要,理論上,可以定義格式,來攜帶更多信息. 當然了,如果有特殊排序和組合需求,還是要寫特殊的Writable了.
outkey.set(post.getAttribute("ParentId"));
outvalue.set("A" + value.toString());

3.2（分區法）Partitioning

這個又來了,這個是built-in,寫自己的partitioner,進行定向Reducer.

3.3（裝箱法）Binning

這個有點意思,類似于分區法,不過它是MapSide Only的,效率較高,不過產生的結果可能需

要進一步merge.
The SPLIT operation in Pig implements this pattern.
具體實現上還是使用了MultipleOutputs.addNamedOutput().

// Configure the MultipleOutputs by adding an output called "bins"
// With the proper output format and mapper key/value pairs

MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class,Text.class, NullWritable.class);

// Enable the counters for the job
// If there are a significant number of different named outputs, this
// should be disabled

MultipleOutputs.setCountersEnabled(job, true);

// Map-only job
job.setNumReduceTasks(0);

3.4（全排序）Total Order Sorting

這個在Hadoop部分已經詳細描述過了，略。

3.5（洗牌）Shuffling

這個的精髓在于隨機key的創建。
outkey.set(rndm.nextInt());
context.write(outkey, outvalue);

4 (連接)Join Patterns

4.1（Reduce連接）Reduce Side Join

這個比較簡單，Structured to Hierarchical中已經講過了。

4.2（Mapside連接）Replicated Join

Mapside連接效率較高，但是需要把較小的數據集進行設置到distributeCache，然后把

另一份數據進入map，在map中完成連接。

4.3（組合連接）Composite Join

這種模式也是MapSide的join，而且可以進行兩個大數據集的join，然而，它有一個限制就是兩個數據集必須是相同組織形式的，那么何謂相同組織形式呢？
? An inner or full outer join is desired.
? All the data sets are sufficiently large.
? All data sets can be read with the foreign key as the input key to the mapper.
? All data sets have the same number of partitions.
? Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set. That is, partition X of data sets A and B contain
the same foreign keys and these foreign keys are present only in partition X. For a visualization of this partitioning and sorting key, refer to Figure 5-3.
? The data sets do not change often (if they have to be prepared).

// The composite input format join expression will set how the records
// are going to be read in, and in what input format.
conf.set("mapred.join.expr", CompositeInputFormat.compose(joinType,
KeyValueTextInputFormat.class, userPath, commentPath));

4.4（笛卡爾）Cartesian Product

這個需要重寫InputFormat，以便兩部分數據可以在record級別聯合起來。sample略。

5 (元模式)MetaPatterns

5.1（鏈式Job）Job Chaining

多種方式，可以寫在driver里邊，也可以用bash腳本調用。hadoop也提供了JobControl

可以跟蹤失敗的job等好的功能。

5.2（折疊Job）Chain Folding

ChainMapper and ChainReducer Approach，M+R*M

5.3（合并Job）Job Merging

合并job，就是把同數據的兩個job的mapper和reducer代碼級別的合并，這樣可以省去

I/O和解析的時間。

6 (輸入輸出)Input and Output Patterns

6.1 Customizing Input and Output in Hadoop

InputFormat
getSplits
createRecordReader
InputSplit
getLength()
getLocations()
RecordReader
  initialize
  getCurrentKey and getCurrentValue
  nextKeyValue
  getProgress
  close
OutputFormat
  checkOutputSpecs
  getRecordWriter
  getOutputCommiter
RecordWriter
write
close

6.2 (產生Random數據)Generating Data

關鍵點：構建虛假的InputSplit，這個不像FileInputSplit基于block，只能去騙hadoop了。

到此，相信大家對“MapReduce設計模式有哪些”有了更深的了解，不妨來實際操作一番吧！這里是億速云網站，更多相關內容可以進入相關頻道進行查詢，關注我們，繼續學習！

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

MapReduce設計模式有哪些

1 (總計)Summarization Patterns

1.1（數字統計）Numerical Summarizations

1.2（反向索引）Inverted Index Summarizations

1.3（計數器統計）Counting with Counters

2 (過濾)Filtering Patterns

2.1（簡單過濾）Filtering

2.2（Bloom過濾）Bloom Filtering

2.3（Top N）Top Ten

2.4（排重）Distinct

3 (數據組織)Data Organization Patterns

3.1（結構化到層級化）Structured to Hierarchical

3.2（分區法）Partitioning

3.3（裝箱法）Binning

3.4（全排序）Total Order Sorting

3.5（洗牌）Shuffling

4 (連接)Join Patterns

4.1（Reduce連接）Reduce Side Join

4.2（Mapside連接）Replicated Join

4.3（組合連接）Composite Join

4.4（笛卡爾）Cartesian Product

5 (元模式)MetaPatterns

5.1（鏈式Job）Job Chaining

5.2（折疊Job）Chain Folding

5.3（合并Job）Job Merging

6 (輸入輸出)Input and Output Patterns

6.1 Customizing Input and Output in Hadoop

6.2 (產生Random數據)Generating Data

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

MapReduce設計模式有哪些

1 (總計)Summarization Patterns

1.1（數字統計）Numerical Summarizations

1.2（反向索引）Inverted Index Summarizations

1.3（計數器統計）Counting with Counters

2 (過濾)Filtering Patterns

2.1（簡單過濾）Filtering

2.2（Bloom過濾）Bloom Filtering

2.3（Top N）Top Ten

2.4（排重）Distinct

3 (數據組織)Data Organization Patterns

3.1（結構化到層級化）Structured to Hierarchical

3.2（分區法）Partitioning

3.3（裝箱法）Binning

3.4（全排序）Total Order Sorting

3.5（洗牌）Shuffling

4 (連接)Join Patterns

4.1（Reduce連接）Reduce Side Join

4.2（Mapside連接）Replicated Join

4.3（組合連接）Composite Join

4.4（笛卡爾）Cartesian Product

5 (元模式)MetaPatterns

5.1（鏈式Job）Job Chaining

5.2（折疊Job）Chain Folding

5.3（合并Job）Job Merging

6 (輸入輸出)Input and Output Patterns

6.1 Customizing Input and Output in Hadoop

6.2 (產生Random數據)Generating Data

猜你喜歡

最新資訊

相關推薦

相關標簽