



點擊 登錄注冊 即表示同意《億速云用戶服務條款》

The index Plan

發布時間:2020-07-05 03:45:10 來源:網絡 閱讀:307 作者:努力的C 欄目:開發技術

In order to index the CSV, we want to take two fields from each row, title and description, and turn them into suitable terms. For straightforward textual search we don’t need document values.

Because we’re dealing with free text, and because we know the whole dataset is in English, we can use stemming so that for instance searching for “sundial” and “sundials” will both match the same documents. This way people don’t need to worry too much about exactly which words to use in their query.

Finally, we want a way of separating the two fields. In Xapian this is done using term prefixes, basically by putting short strings at the beginning of terms to indicate which field the term indexes. As well as prefixed terms, we also want to generate unprefixed terms, so that as well as searching within fields you can also search for text in any field.

There are some conventional prefixes used, which is helpful if you ever need to interoperate with omega (a web-based search engine) or other compatible systems. From this, we’ll use ‘S’ to prefix title (it stands for ‘subject’), and for description we’ll use ‘XD’. A full list of conventional prefixes is given at the top of the omega documentation on termprefixes.

When you’re indexing multiple fields like this, the term positions used for each field when indexed unprefixed need to be kept apart. Say you have a title of “The Saints”, and description “Don’t like rabbits? Keep reading.” If you index those fields without a gap, the phrase search “Saints don’t like rabbits” will match, where it really shouldn’t. Usually a gap of 100 between each field is enough.

To write to a database, we use the WritableDatabase class, which allows us to create, update or overwrite a database.

To create terms, we use Xapian’s TermGenerator, a built-in class to make turning free text into terms easier. It will split into words, apply stemming, and then add term prefixes as needed. It can also take care of term positions, including the gap between different fields.



最后,我們想要一種分離這兩個字段的方法。在Xapian中,這是使用trem  prefixes完成的,基本上是通過在術語開頭放短字符串來指示術語索引的字段。除了前綴術語之外,我們還要生成無偏見的術語,以便在字段內搜索,也可以在任何字段中搜索文本。

有一些常規的前綴使用,如果您需要與omega(基于Web的搜索引擎)或其他兼容系統進行互操作,這是有幫助的。從此,我們將使用'S'來標題(它代表'subject'),對于描述,我們將使用'XD'。 omega文檔的頂部提供了常規前綴的完整列表。







酒泉市| 准格尔旗| 巴青县| 海宁市| 姚安县| 图们市| 新泰市| 文安县| 犍为县| 平湖市| 谢通门县| 临澧县| 沁水县| 大新县| 临城县| 西林县| 洱源县| 都兰县| 浑源县| 于都县| 通许县| 福鼎市| 颍上县| 清水河县| 吉安市| 康平县| 行唐县| 土默特右旗| 根河市| 那曲县| 武冈市| 平阳县| 台东县| 岑巩县| 吉水县| 定西市| 罗山县| 宁蒗| 仁化县| 新昌县| 砚山县|