elasticsearch2.x+IK分词器+Java实现配置近义词功能教程-iaspnetcore.com

http://blog.csdn.net/u012859681/article/details/60147864 说明 es版本：5.1.1 ik版本：5.1.2 开发：Java，TransportClient http://blog.csdn.net/tianzhaixing2013/article/details/51506496 上面这个链接的这篇文章是es2.x版本+IK的近义词配置教程，es5.1的话一些地方还不一样。我从这篇文章中学到了不少，在此谢谢作者。然后自己在此基础上改了改试了试，终于实现了近义词的功能。看网上关于es5.x配置近义词的资料很少，于是用Java api实现了之后，把过程记录下来供新学的小伙伴参考。（文末也给出了rest接口方式实现的相关说明）一. 建立Java工程（略）二. 新建近义词词库首先在elasticsearch-5.1.1/config路径下新建近义词词库文件synonyms.txt。编码格式utf-8。然后写入近义词内容，如下：儿童, 婴儿, 幼儿, 婴幼儿, 初生儿文胸 => 文胸, 内衣 1 2 1 2 这里注意逗号一定要是英文的，我最开始写成了中文的逗号，结果完成之后不起作用也不报错，让我还在怀疑这样的做法有问题。另外解释下近义词的两种写法：逗号隔开，如’儿童, 婴儿’。可以配置多个词语，只要用逗号隔开就行。这种格式代表所有词语之前都是互等的，当你索引内容中有’儿童’时，会将’儿童’分词为’儿童’和’婴儿’，并分别建立一个索引。箭头隔开，如’文胸 => 文胸, 内衣’。箭头左右都可以配置多个词语，也是用逗号隔开。但是这里逗号隔开的词语与上一种格式中’逗号隔开’所代表的意思无关。这种格式代表箭头前面的词语可以分词为箭头后面的词语，但是箭头后面的词语不会分词为前面的词语。当你索引内容中有’文胸’时，会分词为’文胸’和’内衣’，并分别建立索引，但是当索引内容为’内衣’时，只会分词为’内衣’。所以逗号隔开时，会对全部词语都建立索引；箭头隔开时，会对箭头右方的每个词语建立索引。三. 定义自己的解释器 elasticsearch有一个synonym 的同义词filter，所以为了结合IK，我们需要借助IK定义自己的分词器。在工程的resource文件夹下新建setting.json。内容如下： { "index": { "analysis": { "analyzer": { "by_smart": { "type": "custom", "tokenizer": "ik_smart", "filter": ["by_tfr", "by_sfr"], "char_filter": ["by_cfr"] }, "by_max_word": { "type": "custom", "tokenizer": "ik_max_word", "filter": ["by_tfr", "by_sfr"], "char_filter": ["by_cfr"] } }, "filter": { "by_tfr": { "type": "stop", "stopwords": [" "] }, "by_sfr": { "type": "synonym", "synonyms_path": "synonyms.txt" } }, "char_filter": { "by_cfr": { "type": "mapping", "mappings": ["| => |"] } } } } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 这里by_smart和by_max_word就是自定义的分词器，分别使用ik_smart和ik_max_word做tokenizer，配合synonym类型的filter，完成近义词的功能。 char_filter里的by_cfr的作用就是可以把一个字符转换为另一个字符，如’& => and‘。了解更多见官方文档https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html 四. 定义索引中type的mapping 同样，在工程的resource文件夹下建立mapping.json。内容如下： { "properties": { "title": { "type": "text", "index": "analyzed", "analyzer": "by_max_word", "search_analyzer": "by_smart" } } } 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 在这里面title字段的分词器就指定成我们之前定义的分词器。五. 调用api设置解释器和建立索引 ...... String mapping = 读取setting.json的字符串; String settings = 读取mapping.json的字符串; CreateIndexResponse createIndexResponse = client.admin().indices().prepareCreate("indexname") .setSettings(settings) .addMapping("typename", mapping) .get(); ...... 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 六. 测试分词测试： curl -XGET 'http://localhost:9200/indexname/_analyze?pretty&analyzer=by_smart' -d '{"text":"儿童"}' 1 1 返回结果： { "tokens" : [ { "token" : "儿童", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "婴儿", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "幼儿", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "婴幼儿", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "初生儿", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 } ] } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 七. REST接口方式（对应三、四、五步）设置解释器 curl -XPUT 'http://localhost:9200/indexname' -d' { "index": { "analysis": { "analyzer": { "by_smart": { "type": "custom", "tokenizer": "ik_smart", "filter": [ "by_tfr", "by_sfr" ], "char_filter": [ "by_cfr" ] }, "by_max_word": { "type": "custom", "tokenizer": "ik_max_word", "filter": [ "by_tfr", "by_sfr" ], "char_filter": [ "by_cfr" ] } }, "filter": { "by_tfr": { "type": "stop", "stopwords": [ " " ] }, "by_sfr": { "type": "synonym", "synonyms_path": "synonyms.txt" } }, "char_filter": { "by_cfr": { "type": "mapping", "mappings": [ "| => |" ] } } } } }' 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 设置type的mapping curl -XPUT 'http://localhost:9200/indexname/_mapping/typename' -d' { "properties": { "title": { "type": "text", "index": "analyzed", "analyzer": "by_max_word", "search_analyzer": "by_smart" } } }' 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 至此就是此篇文章的所有内容。有写得不对的地方，欢迎指正。有疑惑的地方，共同交流。.