elasticSearch-分词器介绍

Anaiysis 与 Analyzer

  • Analysis - 文本分析是把全文本转换成一系列的单词(term /token)的过程,也叫分词
  • Analysis 是通过 Analyzer 来实现的
    • 可使用 Elasticesearch 内置的分析器 或者按需求定制化分析器
  • 除了在数据写入时转换词条,匹配 Query 语句时候也需要用相同的分析器会查询语句进行分析

Analyzer 的组成

  • 分词器是专门处理分词的组件,Analyzer 由三部分组成
    • Character Filters (针对原始文本处理,例如去除 html)
    • Tokenizer(按照规则切分为单词)
    • Token Filter (将切分的单词进行加工,小写,删除 stopwords,增加同义语)

V22weN

Elasticsearch 的内置分词器

  • Standard Analyzer - 默认分词器,按词切分,小写处理
  • Simple Analyzer - 按照非字母切分(符号被过滤),小写处理
  • Stop Analyzer - 小写处理,停用词过滤(the ,a,is)
  • Whitespace Analyzer - 按照空格切分,不转小写
  • Keyword Analyzer - 不分词,直接将输入当做输出
  • Patter Analyzer - 正则表达式,默认 \W+
  • Language - 提供了 30 多种常见语言的分词器
  • Customer Analyzer 自定义分词器

Standard Analyzer

  • 默认的分词器
  • 按词切分
  • 小写处理
    1
    2
    3
    4
    5
    6
    #standard
    GET _analyze
    {
    "analyzer": "standard",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    7sIoBX

Simple Analyzer

  • 按照非字母切分,非字母的都被去除
  • 小写处理
    1
    2
    3
    4
    5
    6
    #simple 去除非字母的 :2 -  xi
    GET _analyze
    {
    "analyzer": "simple",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }

Whitespace Analyzer

  • 空格切分
    1
    2
    3
    4
    5
    6
    #stop
    GET _analyze
    {
    "analyzer": "whitespace",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    ASsP2B

Stop Analyzer

  • 相比 Simple Analyzer
  • 多了 stop filter
    • 后把 the ,a, is,in 等修饰性词语去除
1
2
3
4
5
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

OZVFq4

Keyword Analyzer

  • 不分词,直接将输入当作一个 term 输出
    1
    2
    3
    4
    5
    6
    #keyword
    GET _analyze
    {
    "analyzer": "keyword",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    ArA8J3

Pattern Analyzer

  • 通过正则表达进行分词
  • 默认是 \W+,非字符的符号进行分隔
    1
    2
    3
    4
    5
    GET _analyze
    {
    "analyzer": "pattern",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    B6ISBW

Language Analyzer

  • 各国语言分词
    1
    2
    3
    4
    5
    6
    #english
    GET _analyze
    {
    "analyzer": "english",
    "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }

使用 _analyzer Api

  • 直接指定 Analyzer 进行测试
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
GET _analyze
{
"analyzer": "standard",
"text" : "Mastering Elasticsearch , elasticsearch in Action"
}
//返回结果
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "elasticsearch",
"start_offset" : 10,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "elasticsearch",
"start_offset" : 26,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "in",
"start_offset" : 40,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "action",
"start_offset" : 43,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
  • 指定索引的字段进行测试
    1
    2
    3
    4
    5
    POST books/_analyze
    {
    "field": "title",
    "text": "Mastering Elasticesearch"
    }
  • 自定义分词进行测试
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
POST /_analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "Mastering Elasticesearch"
}
//结果返回
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "elasticesearch",
"start_offset" : 10,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}

专题目录

ElasticStack-安装篇
ElasticStack-elasticsearch篇
ElasticStack-logstash篇
elasticSearch-mapping相关
elasticSearch-分词器介绍
elasticSearch-分词器实践笔记
elasticSearch-同义词分词器自定义实践
docker-elk集群实践
filebeat与logstash实践
filebeat之pipeline实践
Elasticsearch 7.x 白金级 破解实践
elk的告警调研与实践