ElasticSearch相關性打分機制是什么

發布時間：2021-10-20 17:51:49 來源：億速云閱讀：282 作者：柒染欄目：大數據

今天就跟大家聊聊有關ElasticSearch相關性打分機制是什么，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結了以下內容，希望大家根據這篇文章可以有所收獲。

ElasticSearch 2.3版本全文搜索默認采用的是相關性打分TFIDF，在實際的運用中，我們采用Multi-Match給各個字段設置權重、使用should給特定文檔權重或使用更高級的Function_Score來自定義打分，借助于Elasticsearch的explain功能，我們可以深入地學習一下其中的機制。

創建一個索引

PUT /gino_test
{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

插入測試數據：

ElasticSearch相關性打分機制是什么

簡單情況：單字段匹配打分

POST gino_test/_search
{
  "explain": true,
  "query": {
    "match": {
      "text": "my cup"
    }
  }
}

查詢結果： score_simple.json

打分分析:

ElasticSearch相關性打分機制是什么

ElasticSearch目前采用的默認相關性打分采用的是Lucene的TF-IDF技術。

ElasticSearch相關性打分機制是什么

我們來深入地分析一下這個公式：

score(q,d)  =  queryNorm(q)  · coord(q,d)  · ∑ (tf(t,d) · idf(t)2 · t.getBoost() · norm(t,d))

score(q,d) 是指查詢輸入Q和當前文檔D的相關性得分；
queryNorm(q) 是查詢輸入歸一化因子，其作用是使最終的得分不至于太大，從而具有一定的可比性；
coord(q,d) 是協調因子，表示輸入的Token被文檔匹配到的比例；
tf(t,d) 表示輸入的一個Token在文檔中出現的頻率，頻率越高，得分越高；
idf(t) 表示輸入的一個Token的頻率級別，它具體的計算與當前文檔無關，而是與索引中出現的頻率相關，出現頻率越低，說明這個詞是個稀缺詞，得分會越高；
t.getBoost() 是查詢時指定的權重.
norm(t,d) 是指當前文檔的Term數量的一個權重，它在索引階段就已經計算好，由于存儲的關系，它最終值是0.125的倍數。

注意：在計算過程中，涉及的變量應該考慮的是document所在的分片而不是整個index。

score(q,d) = _score(q,d.f)                                      --------- ①
= queryNorm(q) · coord(q,d) · ∑ (tf(t,d) · idf(t)2 · t.getBoost() · norm(t,d))
= coord(q,d) · ∑ (tf(t,d) · idf(t)2 · t.getBoost() · norm(t,d) · queryNorm(q))
= coord(q,d.f) · ∑ _score(q.ti, d.f) [ti in q]                  --------- ②
= coord(q,d.f) · (_score(q.t1, d.f) + _score(q.t2, d.f))

① 相關性打分其實是查詢與某個文檔的某個字段之間的相關性打分，而不是與文檔的相關性；
② 根據公式轉換，就變成了查詢的所有Term與文檔中字段的相關性求和，如果某個Term不相關，則需要處理coord系數；

multi-match多字段匹配打分（best_fields模式）

POST /gino_test/_search
{
  "explain": true,
  "query": {
    "multi_match": {
      "query": "gino cup",
      "fields": [
        "text^8",
        "fullname^5"
      ]
    }
  }
}

查詢結果：score_bestfields.json

打分分析：

score(q,d) = max(_score(q, d.fi)) = max(_score(q, d.f1), _score(q, d.f2))
= max(coord(q,d.f1) · (_score(q.t1, d.f1) + _score(q.t2, d.f1)), coord(q,d.f2) · (_score(q.t1, d.f2) + _score(q.t2, d.f2)))

對于multi-field的best_fields模式來說，相當于是對每個字段對查詢分別進行打分，然后執行max運算獲取打分最高的。
在計算query weight的過程需要乘上字段的權重，在計算fieldNorm的時候也需要乘上字段的權重。
默認operator為or，如果使用and，打分機制也是一樣的，但是搜索結果會不一樣。

multi-match多字段匹配打分（cross_fields模式）

POST /gino_test/_search
{
  "explain": true,
  "query": {
    "multi_match": {
      "query": "gino cup",
      "type": "cross_fields",
      "fields": [
        "text^8",
        "fullname^5"
      ]
    }
  }
}

查詢結果：score_crossfields.json

打分分析：

score(q, d) = ∑ (_score(q.ti, d.f)) = ∑ (_score(q.t1, d.f), _score(q.t1, d.f))
= ∑ (max(coord(q.t1,d.f) · _score(q.t1, d.f1), coord(q.t1,d.f) · _score(q.t1, d.f2)), max(coord(q.t2,d.f) · _score(q.t2, d.f1), coord(q.t2,d.f) · _score(q.t2, d.f2)))

coord(q.t1,d.f)函數表示搜索的Term(如gino)在multi-field中有多少比率的字段匹配到；best_fields模式中coord(q,d.f1)表示搜索的所以Term(如gino和cup)有多少比率存在與特定的field字段（如text字段）里；
對于multi-field的cross_fields模式來說，相當于是對每個查詢的Term進行打分（每個Term執行best_fields打分，即看下哪個field匹配更高），然后執行sum運算。
默認operator為or，如果使用and，打分機制也是一樣的，但是搜索結果會不一樣。這是一個使用operator為or的報文：score_crossfields_or.json

should增加權重打分

為了增加filter的測試，給gino_test/tweet增加一個tags的字段。

PUT /gino_test/_mapping/tweet
{
  "properties": {
    "tags": {
      "type": "string",
      "analyzer": "fulltext_analyzer"
    }
  }
}

增加tags的標簽

ElasticSearch相關性打分機制是什么

POST /gino_test/_search
{
  "explain": true,
  "query": {
    "bool": {
      "must": {
        "bool": {
          "must": {
            "multi_match": {
              "query": "gino cup",
              "fields": [
                "text^8",
                "fullname^5"
              ],
              "type": "best_fields",
              "operator": "or"
            }
          },
          "should": [
            {
              "term": {
                "tags": {
                  "value": "goods",
                  "boost": 6
                }
              }
            },
            {
              "term": {
                "tags": {
                  "value": "hobby",
                  "boost": 3
                }
              }
            }
          ]
        }
      }
    }
  }
}

查詢結果：score_should.json

打分分析：

ElasticSearch相關性打分機制是什么

增加了should的權重之后，相當于多了一個打分參考項，打分的過程見上面的計算過程。

function_score高級打分機制

DSL格式：

{
    "function_score": {
        "query": {},
        "boost": "boost for the whole query",
        "functions": [
            {
                "filter": {},
                "FUNCTION": {}, 
                "weight": number
            },
            {
                "FUNCTION": {} 
            },
            {
                "filter": {},
                "weight": number
            }
        ],
        "max_boost": number,
        "score_mode": "(multiply|max|...)",
        "boost_mode": "(multiply|replace|...)",
        "min_score" : number
    }
}

支持四種類型發FUNCTION:

script_score: 自定義的高級打分機制，涉及的字段只能是數值類型的
weight: 權重打分，一般結合filter一起使用，表示滿足某種條件加多少倍的分
random_score：生成一個隨機分數，比如應該uid隨機打亂排序
field_value_factor：根據index里的某個字段值影響打分，比如銷量（涉及的字段只能是數值類型的）
decay functions: 衰減函數打分，比如越靠近市中心的打分越高

來做一個實驗。先給index增加一個查看數的字段：

PUT /gino_test/_mapping/tweet
{
  "properties": {
    "views": {
      "type": "long",
      "doc_values": true,
      "fielddata": {
        "format": "doc_values"
    }
  }
}

給三條數據分別加上查看數的值：

POST gino_test/tweet/1/_update
{
    "doc" : {
        "views" : 56
    }
}

最終數據的樣子：

ElasticSearch相關性打分機制是什么

執行一個查詢：

{
  "explain": true,
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "gino cup",
          "type": "cross_fields",
          "fields": [
            "text^8",
            "fullname^5"
          ]
        }
      },
      "boost": 2,
      "functions": [
        {
          "field_value_factor": {
            "field": "views",
            "factor": 1.2,
            "modifier": "sqrt",
            "missing": 1
          }
        },
        {
          "filter": {
            "term": {
              "tags": {
                "value": "goods"
              }
            }
          },
          "weight": 4
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

查詢結果：score_function.json

打分分析：

score(q,d) = score_query(q,d) * (score_fvf(`view`) * score_filter(`tags:goods`))

score_mode表示多個FUNCTION之間打分的運算法則，需要注意不同的FUNCTION的打分的結果級別可能相差很大；
boost_mode表示function_score和query_score打分的運算法則，也需要注意打分結果的級別；

rescore重打分機制

ES官網介紹: Rescoring | Elasticsearch Reference [2.3] | Elastic

重打分機制并不會應用到所有的數據中。比如需要查詢前10條數據，那么所有的分片先按默認規則查詢出前10條數據，然后應用rescore規則進行重打分返回給master節點進行綜合排序返回給用戶。

rescore支持多個規則計算，以及與原先的默認打分進行運算（權重求和等）。

rescore因為計算的打分的document較少，性能應該會更好一點，但是這個涉及到全局排序，實際運用的場景要注意。

看完上述內容，你們對ElasticSearch相關性打分機制是什么有進一步的了解嗎？如果還想了解更多知識或者相關內容，請關注億速云行業資訊頻道，感謝大家的支持。

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

ElasticSearch相關性打分機制是什么

創建一個索引

簡單情況：單字段匹配打分

multi-match多字段匹配打分（best_fields模式）

multi-match多字段匹配打分（cross_fields模式）

should增加權重打分

function_score高級打分機制

rescore重打分機制

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

ElasticSearch相關性打分機制是什么

創建一個索引

簡單情況：單字段匹配打分

multi-match多字段匹配打分（best_fields模式）

multi-match多字段匹配打分（cross_fields模式）

should增加權重打分

function_score高級打分機制

rescore重打分機制

猜你喜歡

最新資訊

相關推薦

相關標簽