ElasticSearch入门：ES分词器与自定义分词器

热爱养熊养花的白兔 2023-04-12 原文

ES入门：ES分词器与自定义分词器

分词器的简单介绍

分词器是es中的一个组件，通俗意义上理解，就是将一段文本按照一定的逻辑，分析成多个词语，同时对这些词语进行常规化的一种工具；ES会将text格式的字段按照分词器进行分词，并编排成倒排索引，正是因为如此，es的查询才如此之快；

es本身就内置有多种分词器，他们的特性与作用梳理如下：

分词器	作用
Standard	ES默认分词器，按单词分类并进行小写处理
Simple	按照非字母切分，然后去除非字母并进行小写处理
Stop	按照停用词过滤并进行小写处理，停用词包括the、a、is
Whitespace	按照空格切分
Language	据说提供了30多种常见语言的分词器
Patter	按照正则表达式进行分词，默认是\W+ ,代表非字母
Keyword	不进行分词，作为一个整体输出

这些分词器用于处理单词和字母，那功能基本已经覆盖，可以说是相当全面了！但对于中文而言，不同汉字组合成词语，往往多个字符组合在一起表达一种意思，显然，上述分词器无法满足需求；对应于中文，目前也有许多对应分词器，例如：IK，jieba，THULAC等，使用最多的即是IK分词器。
除了中文文字以外，我们也经常会使用拼音，例如各类输入法，百度的搜索框等都支持拼音的联想搜索，那么假如将数据存入到es中，如何通过拼音搜索我们想要的数据呢，这个时候对应的拼音分词器可以有效帮助到我们，它的开发者也正是ik分词器的创始人。

不同分词器的效果对比

各种分词器的功能介绍令人眼花缭乱，那么，在业务的应用与开发中，我们该如何选择合适的分词器来满足我们的业务需求呢？具体可以根据分词器的分词效果酌情选择；接下来就具体看看各个分词器的分词效果吧~

以 “text” : “白兔万岁A*” 为例：

standard分词器 —— ES默认分词器，对于中文会按每个字分开处理，会忽略特殊字符

{
    "tokens": [
        {
            "token": "白",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "兔",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "万",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "岁",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "a",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}

ik 分词器 —— 适用于根据词语查询整个内容信息，同样忽略其他特殊字符以及英文字符

{
    "tokens": [
        {
            "token": "白兔",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "万岁",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "万",
            "start_offset": 2,
            "end_offset": 3,
            "type": "TYPE_CNUM",
            "position": 2
        },
        {
            "token": "岁",
            "start_offset": 3,
            "end_offset": 4,
            "type": "COUNT",
            "position": 3
        }
    ]
}

pinyin 分词器 —— 适用于通过拼音查询到对应字段信息，同时忽略特殊字符

{
    "tokens": [
        {
            "token": "bai",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
            "token": "btwsa",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
            "token": "tu",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 1
        },
        {
            "token": "wan",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 2
        },
        {
            "token": "sui",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 3
        },
        {
            "token": "a",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 4
        }
    ]
}

自定义分词器的应用

不同分词器的分词效果各有不同，那么，假如我们需要完成一个模糊查询的搜索功能，以多种形式查询es中的同一个字段，例如类似于百度搜索框那样，既想通过简单词语或者单个字去搜索，又想根据拼音去搜索，很明显，单一种类的分词器是非常难以满足业务需求的；
此时，可以考虑构建索引字段中不同的field去适配多个分词器，例如：我们可以将字段设置多个分词器：

mapping:
{
    "properties":{
        "name":{
            "type":"text",
            "analyzer":"ik_max_word"
        },
        "fields":{
            "PY":{
                "type":"text",
                "analyzer":"pinyin"
            }
        }
    }
}

如果想要更加自由地使用es的分词功能，也许还能打开另一扇通往成功的大门 —— 自定义分词器，自定义分词器，顾名思义，就是通过不同分词器的组合以及相关属性设置，去创建符合自己心意的分词器，例如，如果我们既想通过词语联想一句话，又想享受拼音自动拼写转成词语的便捷，那么何不定义一个专属的分词器呢？例如：定义一个ik与拼音结合的分词器：

{
    "analysis":{
        "analyzer":{
            "my_max_analyzer":{
                "tokenizer":"ik_max_word",
                "filter":"py"
            },
            "my_smart_analyzer":{
                "tokenizer":"",
                "filter":"py"
            }
        },
        "filter":{
            "py":{
                "type":"pinyin",
                "first_letter":"prefix",
                "keep_separate_first_letter":true,
                "keep_full_pinyin":true,
                "keep_joined_full_pinyin":true,
                "keep_original":true,
                "limit_first_letter_length":16,
                "lowercase":true,
                "remove_duplicated_term":true
            }
        }
    }
}

此时，对应 “白兔万岁A*" 分词效果如下：

{
    "tokens": [
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "bai",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "白兔",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "baitu",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "bt",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "t",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "tu",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "w",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "wan",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "s",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "sui",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "万岁",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "wansui",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "ws",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "w",
            "start_offset": 2,
            "end_offset": 3,
            "type": "TYPE_CNUM",
            "position": 4
        },
        {
            "token": "wan",
            "start_offset": 2,
            "end_offset": 3,
            "type": "TYPE_CNUM",
            "position": 4
        },
        {
            "token": "万",
            "start_offset": 2,
            "end_offset": 3,
            "type": "TYPE_CNUM",
            "position": 4
        },
        {
            "token": "s",
            "start_offset": 3,
            "end_offset": 4,
            "type": "COUNT",
            "position": 5
        },
        {
            "token": "sui",
            "start_offset": 3,
            "end_offset": 4,
            "type": "COUNT",
            "position": 5
        },
        {
            "token": "岁",
            "start_offset": 3,
            "end_offset": 4,
            "type": "COUNT",
            "position": 5
        }
    ]
}

自定 ElasticSearch span class token 搜索引擎大数据

有关ElasticSearch入门：ES分词器与自定义分词器的更多相关文章

ruby - Facter::Util::Uptime:Module 的未定义方法 get_uptime (NoMethodError) - 2
我正在尝试设置一个puppet节点，但rubygems似乎不正常。如果我通过它自己的二进制文件(/usr/lib/ruby/gems/1.8/gems/facter-1.5.8/bin/facter)在cli上运行facter，它工作正常，但如果我通过由rubygems(/usr/bin/facter)安装的二进制文件，它抛出:/usr/lib/ruby/1.8/facter/uptime.rb:11:undefinedmethod`get_uptime'forFacter::Util::Uptime:Module(NoMethodError)from/usr/lib/ruby
ruby-on-rails - Rails 3.2.1 中 ActionMailer 中的未定义方法 'default_content_type=' - 2
我在我的项目中添加了一个系统来重置用户密码并通过电子邮件将密码发送给他，以防他忘记密码。昨天它运行良好(当我实现它时)。当我今天尝试启动服务器时，出现以下错误。=>BootingWEBrick=>Rails3.2.1applicationstartingindevelopmentonhttp://0.0.0.0:3000=>Callwith-dtodetach=>Ctrl-CtoshutdownserverExiting/Users/vinayshenoy/.rvm/gems/ruby-1.9.3-p0/gems/actionmailer-3.2.1/lib/action_mailer
ruby-on-rails - form_for 中不在模型中的自定义字段 - 2
我想向我的Controller传递一个参数，它是一个简单的复选框，但我不知道如何在模型的form_for中引入它，这是我的观点:{:id=>'go_finance'}do|f|%>Transferirde:para:Entrada:"input",:placeholder=>"Quantofoiganho?"%>Saída:"output",:placeholder=>"Quantofoigasto?"%>Nota:我想做一个额外的复选框，但我该怎么做，模型中没有一个对象，而是一个要检查的对象，以便在Controller中创建一个ifelse，如果没有检查，请帮助我，非常感谢,谢谢
ruby - 主要 :Object when running build from sublime 的未定义方法 `require_relative' - 2
我已经从我的命令行中获得了一切，所以我可以运行rubymyfile并且它可以正常工作。但是当我尝试从sublime中运行它时，我得到了undefinedmethod`require_relative'formain:Object有人知道我的sublime设置中缺少什么吗？我正在使用OSX并安装了rvm。最佳答案或者，您可以只使用“require”，它应该可以正常工作。我认为“require_relative”仅适用于ruby1.9+ 关于ruby-主要:Objectwhenrun
ruby - 在 Ruby 中有条件地定义函数 - 2
我有一些代码在几个不同的位置之一运行:作为具有调试输出的命令行工具，作为不接受任何输出的更大程序的一部分，以及在Rails环境中。有时我需要根据代码的位置对代码进行细微的更改，我意识到以下样式似乎可行:print"Testingnestedfunctionsdefined\n"CLI=trueifCLIdeftest_printprint"CommandLineVersion\n"endelsedeftest_printprint"ReleaseVersion\n"endendtest_print()这导致:TestingnestedfunctionsdefinedCommandLin
ruby - 定义方法参数的条件 - 2
我有一个只接受一个参数的方法:defmy_method(number)end如果使用number调用方法，我该如何引发错误？？通常，我如何定义方法参数的条件？比如我想在调用的时候报错:my_method(1) 最佳答案您可以添加guard在函数的开头，如果参数无效则引发异常。例如:defmy_method(number)failArgumentError,"Inputshouldbegreaterthanorequalto2"ifnumbereputse.messageend#=>Inputshouldbegreaterthano
ruby - 如何在 Grape 中定义哈希数组？ - 2
我使用Ember作为我的前端和GrapeAPI来为我的API提供服务。前端发送类似:{"service"=>{"name"=>"Name","duration"=>"30","user"=>nil,"organization"=>"org","category"=>nil,"description"=>"description","disabled"=>true,"color"=>nil,"availabilities"=>[{"day"=>"Saturday","enabled"=>false,"timeSlots"=>[{"startAt"=>"09:00AM","endAt"=>
ruby - 获取模块中定义的所有常量的值 - 2
我想获取模块中定义的所有常量的值:moduleLettersA='apple'.freezeB='boy'.freezeendconstants给了我常量的名字:Letters.constants(false)#=>[:A,:B]如何获取它们的值的数组，即["apple","boy"]？最佳答案为了做到这一点，请使用mapLetters.constants(false).map&Letters.method(:const_get)这将返回["a","b"]第二种方式:Letters.constants(false).map{|c
ruby - 这两个 Ruby 类初始化定义有什么区别？ - 2
我正在阅读一本关于Ruby的书，作者在编写类初始化定义时使用的形式与他在本书前几节中使用的形式略有不同。它看起来像这样:classTicketattr_accessor:venue,:datedefinitialize(venue,date)self.venue=venueself.date=dateendend在本书的前几节中，它的定义如下:classTicketattr_accessor:venue,:datedefinitialize(venue,date)@venue=venue@date=dateendend在第一个示例中使用setter方法与在第二个示例中使用实例变量之间是
ruby-on-rails - 如何生成传递一些自定义参数的 `link_to` URL？ - 2
我正在使用RubyonRails3.0.9，我想生成一个传递一些自定义参数的link_toURL。也就是说，有一个articles_path(www.my_web_site_name.com/articles)我想生成如下内容:link_to'Samplelinktitle',...#HereIshouldimplementthecode#=>'http://www.my_web_site_name.com/articles?param1=value1¶m2=value2&...我如何编写link_to语句“alàRubyonRailsWay”以实现该目的？如果我想通过传递一些