微信搜索superit|邀请体验:大数据, 数据管理、OLAP分析与可视化平台 | 赞助作者:赞助作者

ElasticSearch学习 – (八)安装中文分词器IK和拼音分词器

elasticsearch aide_941 21℃

ElasticSearch学习 – (八)安装中文分词器IK和拼音分词器

 版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_28988969/article/details/79540620

IK分词器

下载地址:https://github.com/medcl/elasticsearch-analysis-ik

也可以在这个地址选择:https://github.com/medcl/elasticsearch-analysis-ik/releases
这个下载下来了可以直接使用, 所以推荐下载这个

选择elasticsearch对应版本的分词器进行下载
这里写图片描述
进入到对应页面下载
这里写图片描述
找到下载好的文件,右键,解压到当前文件夹
这里写图片描述
进入文件夹,cmd进入dos窗口,使用maven打包
这里写图片描述
输入命令,打包,前提是安装好了maven
这里写图片描述
命令:

mvn package
  • 1

打包好了过后,当前目录多了一个target文件夹,点击进入
这里写图片描述
点击进入releases文件夹
这里写图片描述
右键,解压到当前文件夹
这里写图片描述
进入解压后的文件夹,复制所有文件
这里写图片描述
找到elasticsearch安装目录,在plugins文件夹下面新建ik(任意取名,方便记忆)文件夹,把刚才复制的文件粘贴到ik文件夹下面
这里写图片描述
这里写图片描述

拼音分词器

下载地址https://github.com/medcl/elasticsearch-analysis-pinyin

也可以在这个地址选择:https://github.com/medcl/elasticsearch-analysis-pinyin/releases

下载,安装过程和ik分词器一模一样,参考上面步骤
最终结果
这里写图片描述

测试分词效果

elasticsearch自带分词器效果

GET http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "standard",
  "text" : "我是一名java程序员"
}
  • 1
  • 2
  • 3
  • 4
  • 5

分词效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 4
}
,
{
"token": "程",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 5
}
,
{
"token": "序",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 6
}
,
{
"token": "员",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 7
}
]
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67

使用ik_max_word分词

ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语

GET http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "ik_max_word",
  "text" : "我是一名java程序员"
}
  • 1
  • 2
  • 3
  • 4
  • 5

效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 3
}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 4
}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 5
}
,
{
"token": "程序员",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 6
}
,
{
"token": "程序",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 7
}
,
{
"token": "员",
"start_offset": 10,
"end_offset": 11,
"type": "CN_CHAR",
"position": 8
}
]
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75

使用ik_smart分词

ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有

GET http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "ik_smart",
  "text" : "我是一名java程序员"
}
  • 1
  • 2
  • 3
  • 4
  • 5

分词效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 3
}
,
{
"token": "程序员",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 4
}
]
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43

使用pinyin分词

http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "pinyin",
  "text" : "我是一名java程序员"
}
  • 1
  • 2
  • 3
  • 4
  • 5

效果如下:

{
"tokens": [
{
"token": "wo",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
}
,
{
"token": "wsymjavacxy",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 0
}
,
{
"token": "shi",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
}
,
{
"token": "yi",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
}
,
{
"token": "ming",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3
}
,
{
"token": "ja",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 4
}
,
{
"token": "v",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 5
}
,
{
"token": "a",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 6
}
,
{
"token": "cheng",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 7
}
,
{
"token": "xu",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 8
}
,
{
"token": "yuan",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 9
}
]
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91

IK+pinyin分词配置

创建索引和类型
具体含义参考这篇博客:http://blog.csdn.net/napoay/article/details/73100110

-put http://localhost:9200/demo

{
  "settings": {
    "analysis": {
        "analyzer": {
            "ik_pinyin_analyzer": {//分词器名称,自定义
                "type": "custom",//custom表示自己定制
                "tokenizer": "ik_max_word",//分词的策略
                "filter":["my_pinyin", "word_delimiter"]// 对拼音和分隔的词源做处理
            }
        },
        "filter":{
            "my_pinyin":{
                "type":"pinyin",
                "first_letter":"prefix",
                "padding_char":" "
            }
        }
    }
  },
  "mappings": {
    "article": {
      "properties": {
        "subject": {
          "type": "keyword",
          "fields": {
              "pinyin": {
                  "type": "text",
                  "store": "no",
                  "term_vector": "with_positions_offsets",
                  "analyzer": "ik_pinyin_analyzer",
                  "boost": 10
               }
           }
        }
      }
    }
  }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

索引一个文档

-post http://localhost:9200/demo/article

{
  "subject": "我是一名java程序员"
}
  • 1
  • 2
  • 3
  • 4
  • 5

中文查询

-post http://localhost:9200/demo/article/_search

{
  "query": {
    "match": {
      "subject.pinyin": "程序员"
    }
  }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

结果如下:

{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 14.584841,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 14.584841,
"_source": {
"subject": "我是一名java程序员"
}
}
]
}
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

拼音查询

-post http://localhost:9200/demo/article/_search

{
  "query": {
    "match": {
      "subject.pinyin": "chengxuyuan"
    }
  }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

查询结果:

{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 4.3648314,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 4.3648314,
"_source": {
"subject": "我是一名java程序员"
}
}
]
}
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

注意:使用pinyin分词以后,原始的字段搜索要加上.pinyin后缀,搜索原始字段没有返回结果
这里写图片描述

转载请注明:SuperIT » ElasticSearch学习 – (八)安装中文分词器IK和拼音分词器

喜欢 (0)or分享 (0)