elasticSearch-同义词分词器自定义实践

发表于 2021-06-04 更新于 2024-01-13 阅读次数：阅读次数：

概述

在英语中，一个单词常常是另一个单词的“变种”，如：happy=>happiness，这里的变形就是处理单复数，happy叫做happiness的词干（stem）。而adult=>man,woman，是处理同义词。

或者再如下面，需要达到搜索都能搜索出来，达到一定精确度。

裙子,裙
西红柿,番茄
china,中国,中华人民共和国
男生,男士,man
女生,女士,women

于是就有了需要自定义分词器解决同义词的场景。

实践自定义分词器

自定义分词器其实也就是组合

Character Filter
Tokenizer
Token Filter
这三个的过程。默认分词器仅仅是把这3个默认组合了。

Character Filters

在 Tokenizer 之前对文本进行处理，例如增加删除及替换字符。可以配置多个 Character Filters。会影响 Tokenizer 的 position 和 offset 信息
一些自带的 Character Filters
- HTML strip - 去除 html 标签
- Mapping - 字符串替换
- Pattern replace - 正则匹配替换

Demo char_filter

html_strip 去除html标签

POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
//结果
{
"tokens" : [
  {
    "token" : "hello world",
    "start_offset" : 3,
    "end_offset" : 18,
    "type" : "word",
    "position" : 0
  }
]
}

mapping 字符串替换

POST _analyze
{
"tokenizer": "standard",
"char_filter": [
    {
      "type" : "mapping",
      "mappings" : [ "- => _"]
    }
  ],
"text": "123-456, I-test! test-990 650-555-1234"
}
//返回
{
"tokens" : [
  {
    "token" : "123_456",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<NUM>",
    "position" : 0
  },
  {
    "token" : "I_test",
    "start_offset" : 9,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  },
  {
    "token" : "test_990",
    "start_offset" : 17,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 2
  },
  {
    "token" : "650_555_1234",
    "start_offset" : 26,
    "end_offset" : 38,
    "type" : "<NUM>",
    "position" : 3
  }
]
}

pattern_replace 正则表达式

GET _analyze
{
"tokenizer": "standard",
"char_filter": [
    {
      "type" : "pattern_replace",
      "pattern" : "http://(.*)",
      "replacement" : "$1"
    }
  ],
  "text" : "http://www.elastic.co"
}
//返回
{
"tokens" : [
  {
    "token" : "www.elastic.co",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 0
  }
]
}

Tokenizer

将原始的文本按照一定的规则，切分为词（term or token）
Elasticsearch 内置的 Tokenizers
- whitespace | standard | uax_url_email | pattern | keyword | path hierarchy

Demo tokenizer

path_hierarchy 通过路劲切分

POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a"
}
{
"tokens" : [
  {
    "token" : "/user",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "/user/ymruan",
    "start_offset" : 0,
    "end_offset" : 12,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "/user/ymruan/a",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }
]
}

Token tokenizer Filter

将 Tokenizer 输出的单词，进行增加、修改、删除
自带的 Token Filters
- Lowercase |stop| synonym（添加近义词）

Demo filter

whitespace 空格

GET _analyze
{
"tokenizer": "whitespace", 
"filter": ["stop","snowball"], //on the a
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
  {
    "token" : "The", //大写的The 不做过滤
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "gilr",
    "start_offset" : 4,
    "end_offset" : 9,
    "type" : "word",
    "position" : 1
  },
  {
    "token" : "China",
    "start_offset" : 13,
    "end_offset" : 18,
    "type" : "word",
    "position" : 3
  },
  {
    "token" : "play",
    "start_offset" : 23,
    "end_offset" : 30,
    "type" : "word",
    "position" : 5
  },
  {
    "token" : "game!",
    "start_offset" : 36,
    "end_offset" : 41,
    "type" : "word",
    "position" : 7
  }
]
}

自定义 analyzer

官网自定义分词器的标准格式

自定义分析器标准格式是：
PUT /my_index
{
  "settings": {
      "analysis": {
          "char_filter": { ... custom character filters ... },//字符过滤器
          "tokenizer": { ... custom tokenizers ... },//分词器
          "filter": { ... custom token filters ... }, //词单元过滤器
          "analyzer": { ... custom analyzers ... }
      }
  }
}

自定义分词器

#定义自己的分词器
PUT my_index
{
"settings": {
  "analysis": {
    "analyzer": {
      "my_custom_analyzer":{
        "type":"custom",
        "char_filter":[
          "emoticons"
        ],
        "tokenizer":"punctuation",
        "filter":[
          "lowercase",
          "english_stop"
        ]
      }
    },
    "tokenizer": {
      "punctuation":{
        "type":"pattern",
        "pattern": "[ .,!?]"
      }
    },
    "char_filter": {
      "emoticons":{
        "type":"mapping",
        "mappings" : [ 
          ":) => happy",
          ":( => sad"
        ]
      }
    },
    "filter": {
      "english_stop":{
        "type":"stop",
        "stopwords":"_english_"
      }
    }
  }
}
}

自定义同义词需求解决

synonym_graph
我们定义一个my_synonym_filter的filter进行处理同义词,
同时自定义自己的分词器my_custom_analyzer,并指定字段title使用my_custom_analyzer分词器

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      },
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "british,english",
            "queen,monarch"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_custom_analyzer",
        "search_analyzer": "my_custom_analyzer"
      },
      "author": {
        "type": "keyword"
      }
    }
  }
}

测试一下分词器的效果

POST my_index/_analyze
{
  "field": "title",
  "text": "Elizabeth is the English queen",
  "analyzer": "my_custom_analyzer"
}

我们会发现Elizabeth is the English queen，包含了english，而我们设置了british,english,为同义词，所以分词器就包含了british,english

{
  "tokens" : [
    {
      "token" : "elizabeth",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "the",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "english",
      "start_offset" : 17,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "british",
      "start_offset" : 17,
      "end_offset" : 24,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "queen",
      "start_offset" : 25,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "monarch",
      "start_offset" : 25,
      "end_offset" : 30,
      "type" : "SYNONYM",
      "position" : 4
    }
  ]
}

我们再通过插入数据来测试一下 queen,monarch 同义词


PUT my_index/_doc/1
{
  "title": "Elizabeth is the English queen"
}


GET my_index/_search
{
  "query": {
    "match": {
      "title": "monarch"
    }
  }
}

我们发现我们能通过monarch 查询出来含queen的数据

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "my_index2",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "Elizabeth is the English queen"
        }
      }
    ]
  }
}

自此解决同义词的需求

动态设置同义词的方案

开启热更新，仅仅适用search_analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/7.x/indices-reload-analyzers.html

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "word_syn"
          ]
        }
      },
      "filter": {
        "word_syn": {
          "type": "synonym_graph",
          "synonyms_path": "analysis/synonym.txt",
          ## 开启热更新
          "updateable": true  
        },
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "search_analyzer": "my_custom_analyzer"
      },
      "author": {
        "type": "keyword"
      }
    }
  }
}

热更新重载分词器

1	POST my_index/_reload_search_analyzers

专题目录

ElasticStack-安装篇
 ElasticStack-elasticsearch篇
 ElasticStack-logstash篇
 elasticSearch-mapping相关
 elasticSearch-分词器介绍
 elasticSearch-分词器实践笔记
 elasticSearch-同义词分词器自定义实践
 docker-elk集群实践
 filebeat与logstash实践
 filebeat之pipeline实践
 Elasticsearch 7.x 白金级破解实践
 elk的告警调研与实践