elasticSearch-同义词分词器自定义实践

概述

在英语中,一个单词常常是另一个单词的“变种”,如:happy=>happiness,这里的变形就是处理单复数,happy叫做happiness的词干(stem)。而adult=>man,woman,是处理同义词。

或者再如下面,需要达到搜索都能搜索出来,达到一定精确度。

1
2
3
4
5
裙子,裙
西红柿,番茄
china,中国,中华人民共和国
男生,男士,man
女生,女士,women

于是就有了需要自定义分词器解决同义词的场景。

实践自定义分词器

自定义分词器其实也就是组合

  • Character Filter
  • Tokenizer
  • Token Filter
    这三个的过程。默认分词器仅仅是把这3个默认组合了。

Character Filters

  • 在 Tokenizer 之前对文本进行处理,例如增加删除及替换字符。可以配置多个 Character Filters。会影响 Tokenizer 的 position 和 offset 信息
  • 一些自带的 Character Filters
    • HTML strip - 去除 html 标签
    • Mapping - 字符串替换
    • Pattern replace - 正则匹配替换

Demo char_filter

html_strip 去除html标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
//结果
{
"tokens" : [
{
"token" : "hello world",
"start_offset" : 3,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
mapping 字符串替换
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ "- => _"]
}
],
"text": "123-456, I-test! test-990 650-555-1234"
}
//返回
{
"tokens" : [
{
"token" : "123_456",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "I_test",
"start_offset" : 9,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "test_990",
"start_offset" : 17,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "650_555_1234",
"start_offset" : 26,
"end_offset" : 38,
"type" : "<NUM>",
"position" : 3
}
]
}
pattern_replace 正则表达式
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "pattern_replace",
"pattern" : "http://(.*)",
"replacement" : "$1"
}
],
"text" : "http://www.elastic.co"
}
//返回
{
"tokens" : [
{
"token" : "www.elastic.co",
"start_offset" : 0,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}

Tokenizer

  • 将原始的文本按照一定的规则,切分为词(term or token)
  • Elasticsearch 内置的 Tokenizers
    • whitespace | standard | uax_url_email | pattern | keyword | path hierarchy

Demo tokenizer

path_hierarchy 通过路劲切分
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a"
}
{
"tokens" : [
{
"token" : "/user",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "/user/ymruan",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 0
},
{
"token" : "/user/ymruan/a",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
}
]
}

Token tokenizer Filter

  • 将 Tokenizer 输出的单词,进行增加、修改、删除
  • 自带的 Token Filters
    • Lowercase |stop| synonym(添加近义词)

Demo filter

whitespace 空格
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball"], //on the a
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
{
"token" : "The", //大写的The 不做过滤
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "gilr",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "China",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 3
},
{
"token" : "play",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 5
},
{
"token" : "game!",
"start_offset" : 36,
"end_offset" : 41,
"type" : "word",
"position" : 7
}
]
}

自定义 analyzer

  • 官网自定义分词器的标准格式
1
2
3
4
5
6
7
8
9
10
11
12
自定义分析器标准格式是:
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": { ... custom character filters ... },//字符过滤器
"tokenizer": { ... custom tokenizers ... },//分词器
"filter": { ... custom token filters ... }, //词单元过滤器
"analyzer": { ... custom analyzers ... }
}
}
}
  • 自定义分词器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#定义自己的分词器
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer":{
"type":"custom",
"char_filter":[
"emoticons"
],
"tokenizer":"punctuation",
"filter":[
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation":{
"type":"pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons":{
"type":"mapping",
"mappings" : [
":) => happy",
":( => sad"
]
}
},
"filter": {
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
}
}

自定义同义词需求解决

  • synonym_graph
    我们定义一个my_synonym_filter的filter进行处理同义词,
    同时自定义自己的分词器my_custom_analyzer,并指定字段title使用my_custom_analyzer分词器
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    PUT my_index
    {
    "settings": {
    "analysis": {
    "analyzer": {
    "my_custom_analyzer": {
    "type": "custom",
    "tokenizer": "standard",
    "filter": [
    "lowercase",
    "my_synonym_filter"
    ]
    }
    },
    "filter": {
    "my_synonym_filter": {
    "type": "synonym",
    "synonyms": [
    "british,english",
    "queen,monarch"
    ]
    }
    }
    }
    },
    "mappings": {
    "properties": {
    "title": {
    "type": "text",
    "analyzer": "my_custom_analyzer",
    "search_analyzer": "my_custom_analyzer"
    },
    "author": {
    "type": "keyword"
    }
    }
    }
    }

测试一下分词器的效果

1
2
3
4
5
6
POST my_index/_analyze
{
"field": "title",
"text": "Elizabeth is the English queen",
"analyzer": "my_custom_analyzer"
}

我们会发现Elizabeth is the English queen,包含了english,而我们设置了british,english,为同义词,所以分词器就包含了british,english

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
{
"tokens" : [
{
"token" : "elizabeth",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "the",
"start_offset" : 13,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "english",
"start_offset" : 17,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "british",
"start_offset" : 17,
"end_offset" : 24,
"type" : "SYNONYM",
"position" : 3
},
{
"token" : "queen",
"start_offset" : 25,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "monarch",
"start_offset" : 25,
"end_offset" : 30,
"type" : "SYNONYM",
"position" : 4
}
]
}

我们再通过插入数据来测试一下 queen,monarch 同义词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

PUT my_index/_doc/1
{
"title": "Elizabeth is the English queen"
}


GET my_index/_search
{
"query": {
"match": {
"title": "monarch"
}
}
}

我们发现我们能通过monarch 查询出来含queen的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_index2",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"title" : "Elizabeth is the English queen"
}
}
]
}
}

自此解决同义词的需求

动态设置同义词的方案

开启热更新,仅仅适用search_analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/7.x/indices-reload-analyzers.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"word_syn"
]
}
},
"filter": {
"word_syn": {
"type": "synonym_graph",
"synonyms_path": "analysis/synonym.txt",
## 开启热更新
"updateable": true
},
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"search_analyzer": "my_custom_analyzer"
},
"author": {
"type": "keyword"
}
}
}
}

热更新重载分词器

1
POST my_index/_reload_search_analyzers

专题目录

ElasticStack-安装篇
ElasticStack-elasticsearch篇
ElasticStack-logstash篇
elasticSearch-mapping相关
elasticSearch-分词器介绍
elasticSearch-分词器实践笔记
elasticSearch-同义词分词器自定义实践
docker-elk集群实践
filebeat与logstash实践
filebeat之pipeline实践
Elasticsearch 7.x 白金级 破解实践
elk的告警调研与实践