手機(jī)站首頁散文詩歌雜文隨筆日記小小說

散文網(wǎng) » 生活 »日常 » 學(xué)習(xí)日志 211229 elasticsearch text analyzer理解

學(xué)習(xí)日志 211229 elasticsearch text analyzer理解

2021-12-29 17:53 作者:mayoiwill 0人讀過 | 我要投稿

# 211229

## 設(shè)置內(nèi)置分析器

- 參考

? - https://www.elastic.co/guide/en/elasticsearch/reference/current/configuring-analyzers.html

- 內(nèi)置分析器自定義

? - 內(nèi)置分析器和自定義分析器的區(qū)別

? - 內(nèi)置分析器用 type

? - 自定義分析器 type統(tǒng)一是custom 之后再詳細(xì)定義tokenizer等三項(xiàng)

? - 內(nèi)置分析器支持的參數(shù)是特定參數(shù), 如stopwords 需要查文檔

? - 自定義分析器參數(shù)是固定的那三項(xiàng)

- 已定義出來的索引的properties字段是不能改的

? - runtime字段是可以改的

? - reindex或者刪了重建

- 使用fields這種子字段, 可以針對同一個原始字段使用不同的分析器

- `stopwords:_english_`表示英語的一些助詞不索引

? - `to be or not to be`問題

? - 先按帶stopwords的索引查查不出再改用不帶stopwords的索引查

## 設(shè)置自定義分析器

- 可以使用內(nèi)置char_filter tokenizer 和 token_filter

- 也可以自定義上述三項(xiàng)

? - 例如創(chuàng)建一個自定義char_filter叫做 emoticons

? - 參數(shù)為? type:mapping mappings數(shù)據(jù) 格式是 xx => yyy

? - 創(chuàng)建一個自定義tokenizer

? - 參數(shù) type:pattern pattern是一個字符串 `[ .,!?]`

? - 創(chuàng)建自定義token_filter

? - 參數(shù) type:stop `stopwords:_english_`

## 分析器使用的優(yōu)先級

- 逐字段設(shè)置 -> 索引級別默認(rèn)設(shè)置 -> 標(biāo)準(zhǔn)

- 可以為構(gòu)建時和搜索時設(shè)置不同的分析器

? - 搜索引分析器針對query起作用

? - search_analyzer

- 參考

? - https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html

- 有多種方式

? - 可以在query里指定

? - 常用的就是在mapping的字段上指定

- 指定方法

```

PUT my-index-000001

{

? "mappings": {

? ? "properties": {

? ? ? "title": {

? ? ? ? "type": "text",

? ? ? ? "analyzer": "whitespace",

? ? ? ? "search_analyzer": "simple"

? ? ? }

? ? }

? }

}

```

## 內(nèi)置分析器簡介

- 有多種, 我們只介紹兩個 fingerprint 和 standard

- fingerprint

? - 基于? fingerprinting algorithm

? - https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint

? - 測試

? ? ```

? ? POST _analyze

? ? {

? ? ? "analyzer": "fingerprint",

? ? ? "text": "Yes yes, G?del said this sentence is consistent and."

? ? }

? ? ```

? - 構(gòu)成

? ? - standard 分詞器

? ? - token_filter有以下幾個按順序

? ? ? - Lower Case Token Filter

? ? ? - ASCII folding

? ? ? - Stop Token Filter (disabled by default)

? ? ? - Fingerprint

? - 其實(shí)關(guān)鍵就是 Fingerprint 這個token_filter

- standard

? - 定義

? - standard tokenizer

? - Lower Case Token Filter

? - Stop Token Filter (disabled by default)

## 自定義分析器

- 選擇一個tokenizer

? - 一般都是選 standard

? - 中文 ICU Tokenizer

- 選擇一些token filter

? - Lower Case Token Filter

? - ASCII folding

? - 同義詞等

### icu_tokenizer

- 參考 https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html

- 基于字典的方法

- 安裝見Q&A部分第二個問題基于k8s的安裝

- 測試

```

PUT icu_sample

{

? "settings": {

? ? "index": {

? ? ? "analysis": {

? ? ? ? "analyzer": {

? ? ? ? ? "my_icu_analyzer": {

? ? ? ? ? ? "tokenizer": "icu_tokenizer"

? ? ? ? ? }

? ? ? ? }

? ? ? }

? ? }

? }

}

GET icu_sample/_analyze

{

? "analyzer": "my_icu_analyzer",

? "text": "南京長江大橋"

}

```

- 為了創(chuàng)建自定義analyzer, 需要把該分析器掛在某個自建索引下

? - 測試時采用 /索引名/_analyze的方法

? - 指定analyzer: 自定義分析器名的方式

? - 該索引可以不含任何字段

### Q&A

- Q: `failed to find tokenizer under name [icu_tokenizer]`

- A: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html

? - `sudo bin/elasticsearch-plugin install analysis-icu`

? - 每個節(jié)點(diǎn)都要裝裝完還要重啟

? - https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-cluster.html

- Q: 上述方法不好用

- A: 針對k8s安裝的elasticsearch 需要用別的方法

? - https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-bundles-plugins.html

? - 里面描述了2種方法自定義image 或者使用 initContainers配置段落(k8s描述文件)

? - 用第2種

? - 真正執(zhí)行的命令改為?

? - `bin/elasticsearch-plugin install analysis-icu`

? - 里面還提到了添加自定義的synonym(同義詞)文件

? - 后續(xù)也會用到

## token filter

- 這個算是elastic search的核心功能之一

- 有很多內(nèi)置filter 之前提到的

? - Lowercase

? - ASCII folding

- 下面找?guī)讉€我感興趣的了解一下

- 其它的大家自己查文檔吧

### Snowball 和 Stemmer

- 詞干化

- 區(qū)別

? - snowball用的是snowball方法

? - stemmer用的是

? ? - https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html#algorithmic-stemmers

? - 基本上講snowball效果會好一點(diǎn)

? - 當(dāng)然效果好一般意味著性能差一點(diǎn)?

- 測試

```

GET /_analyze

{

? "tokenizer": "standard",

? "filter": [ "snowball" ],

? "text": "the foxes jumping quickly"

}

```

? - filter可以選 snowball 或 stemmer

? - 結(jié)果差別

? ? - snowball quickly -> quick

? ? - stemmer quickly -> quickli

? - 這里又學(xué)習(xí)一個不用定義任何索引, 直接測試自定義filter

### 同義詞 synonym

- 需要提供同義詞詞典

- 詞典文件支持兩種文件格式 solr和wordnet

## 自定義同義詞替換

- 下載wordnet 3.0格式的同義詞詞典英語

? - https://github.com/buildbreakdo/elasticsearch-wordnet-synonyms

? - https://github.com/buildbreakdo/elasticsearch-wordnet-synonyms/raw/master/synonyms.json

- 用k8s configmap上傳該詞典文件方案不可行

? - 參考

? - https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/

? - `kubectl create configmap synonyms --from-file=synonym`

? - 問題文件太大無法上傳

? - Request entity too large

- 采用initContainers 復(fù)制文件過去

? - 遇到github網(wǎng)絡(luò)不通的問題

? - 國內(nèi)gitee不能直接curl 需要登錄

? - 自己搞個內(nèi)部nginx掛個pvc吧 TODO 明天繼續(xù)

- 配置k8s上的elasticsearch使用該詞典

## 應(yīng)用分析器到索引

==========

今天比較不順?遇到同義詞上傳的問題

不過也因此學(xué)到了initContainers的用法

明天繼續(xù)搞

標(biāo)簽：

學(xué)習(xí)日志 211229 elasticsearch text analyzer理解的評論 (共條)

愛情散文傷感散文哲理散文優(yōu)美生活隨筆親情唯美句子傷感的句子現(xiàn)代詩歌空間日志經(jīng)典語句愛情句子作文大全

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

學(xué)習(xí)日志 211229 elasticsearch text analyzer理解

學(xué)習(xí)日志 211229 elasticsearch text analyzer理解的評論 (共條)

你可能也喜歡這些文章

最新發(fā)布的文章

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

學(xué)習(xí)日志 211229 elasticsearch text analyzer理解

本文作者的其他文章

學(xué)習(xí)日志 211229 elasticsearch text analyzer理解的評論 (共 條)

你可能也喜歡這些文章

最新發(fā)布的文章

學(xué)習(xí)日志 211229 elasticsearch text analyzer理解的評論 (共條)