图解Elasticsearch中的_source、_all、store和index属性

Hanrea 发表于 2017-4-7 16:33:18 | 显示全部楼层 |阅读模式 [复制链接]
0 240
Elasticsearch中有几个关键属性容易混淆,很多人搞不清楚_source字段里存储的是什么?store属性的true或false和_source字段有什么关系?store属性设置为true和_all有什么关系?index属性又起到什么作用?什么时候设置store属性为true?什么时候应该开启_all字段?本文通过图解的方式,深入理解Elasticsearch中的_source、_all、store和index属性。
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus图1 Elasticsearch中的_source、_all、store和index属性解析
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus图1所示, 第二象限是一份原始文档,有title和content2个字段,字段取值分别为”我是中国人”和” 热爱共产党”,这一点没什么可解释的。我们把原始文档写入Elasticsearch,默认情况下,Elasticsearch里面有2份内容,一份是原始文档,也就是_source字段里的内容,我们在Elasticsearch中搜索文档,查看的文档内容就是_source中的内容,如图2,相信大家一定非常熟悉这个界面。
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus图2 _source字段举例
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus另一份是倒排索引,倒排索引中的数据结构是倒排记录表,记录了词项和文档之间的对应关系,比如关键词”中国人”包含在文档ID为1的文档中,倒排记录表中存储的就是这种对应关系,当然也包括词频等更多信息。Elasticsearch底层用的是Lucene的API,Elasticsearch之所以能完成全文搜索的功能就是因为存储的有倒排索引。如果把倒排索引拿掉,Elasticsearch是不是和mongoDB很像?
EngineBUS enginebus EngineBUS enginebus
那么文档索引到Elasticsearch的时候,默认情况下是对所有字段创建倒排索引的(动态mapping解析出来为数字类型、布尔类型的字段除外),某个字段是否生成倒排索引是由字段的index属性控制的,在Elasticsearch 5之前,index属性的取值有三个:

EngineBUS enginebus EngineBUS enginebus
  • analyzed:字段被索引,会做分词,可搜索。反过来,如果需要根据某个字段进搜索,index属性就应该设置为analyzed。
  • not_analyzed:字段值不分词,会被原样写入索引。反过来,如果某些字段需要完全匹配,比如人名、地名,index属性设置为not_analyzed为佳。
  • no:字段不写入索引,当然也就不能搜索。反过来,有些业务要求某些字段不能被搜索,那么index属性设置为no即可。
    EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus
再说_all字段,顾名思义,_all字段里面包含了一个文档里面的所有信息,是一个超级字段。以图中的文档为例,如果开启_all字段,那么title+content会组成一个超级字段,这个字段包含了其他字段的所有内容,当然也可以设置只存储某几个字段到_all属性里面或者排除某些字段。
回到图一的第一象限,用户输入关键词" 中国人",分词以后,Elasticsearch从倒排记录表中查找哪些文档包含词项"中国人 ",注意变化,分词之前" 中国人"是用户查询(query),分词之后在倒排索引中" 中国人"是词项(term)。Elasticsearch根据文档ID(通常是文档ID的集合)返回文档内容给用户,如图一第四象限所示。
通常情况下,对于用户查询的关键字要做高亮处理,如图3所示:
图3 搜索引擎中的关键字高亮

EngineBUS enginebus EngineBUS enginebus
关键字高亮实质上是根据倒排记录中的词项偏移位置,找到关键词,加上前端的高亮代码。这里就要说到store属性,store属性用于指定是否将原始字段写入索引,默认取值为no。如果在Lucene中,高亮功能和store属性是否存储息息相关,因为需要根据偏移位置到原始文档中找到关键字才能加上高亮的片段。在Elasticsearch,因为_source中已经存储了一份原始文档,可以根据_source中的原始文档实现高亮,在索引中再存储原始文档就多余了,所以Elasticsearch默认是把store属性设置为no。
注意:如果想要对某个字段实现高亮功能,_source和store至少保留一个。下面会给出测试代码。
至此,文章开头提出的几个问题都给出了答案。下面给出这几个字段常用配置的代码。

EngineBUS enginebus EngineBUS enginebus一、_source配置
EngineBUS enginebus EngineBUS enginebus
_source字段默认是存储的, 什么情况下不用保留_source字段?如果某个字段内容非常多,业务里面只需要能对该字段进行搜索,最后返回文档id,查看文档内容会再次到MySQL或者Hbase中取数据,把大字段的内容存在Elasticsearch中只会增大索引,这一点文档数量越大结果越明显,如果一条文档节省几KB,放大到亿万级的量结果也是非常可观的。
EngineBUS enginebus EngineBUS enginebus如果想要关闭_source字段,在mapping中的设置如下:

EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus
  1. {
    EngineBUS enginebus EngineBUS enginebus
  2.     "yourtype":{
    EngineBUS enginebus EngineBUS enginebus
  3.         "_source":{
    EngineBUS enginebus EngineBUS enginebus
  4.             "enabled":false
    EngineBUS enginebus EngineBUS enginebus
  5.         },
    EngineBUS enginebus EngineBUS enginebus
  6.         "properties": {
    EngineBUS enginebus EngineBUS enginebus
  7.             ...
    EngineBUS enginebus EngineBUS enginebus
  8.         }
    EngineBUS enginebus EngineBUS enginebus
  9.     }
    EngineBUS enginebus EngineBUS enginebus
  10. }
复制代码
如果只想存储某几个字段的原始值到Elasticsearch,可以通过incudes参数来设置,在mapping中的设置如下:
EngineBUS enginebus EngineBUS enginebus
  1. {
    EngineBUS enginebus EngineBUS enginebus
  2.     "yourtype":{
    EngineBUS enginebus EngineBUS enginebus
  3.         "_source":{
    EngineBUS enginebus EngineBUS enginebus
  4.             "includes":["field1","field2"]
    EngineBUS enginebus EngineBUS enginebus
  5.         },
    EngineBUS enginebus EngineBUS enginebus
  6.         "properties": {
    EngineBUS enginebus EngineBUS enginebus
  7.             ...
    EngineBUS enginebus EngineBUS enginebus
  8.         }
    EngineBUS enginebus EngineBUS enginebus
  9.     }
    EngineBUS enginebus EngineBUS enginebus
  10. }
复制代码
同样,可以通过excludes参数排除某些字段:
EngineBUS enginebus EngineBUS enginebus
  1. {
    EngineBUS enginebus EngineBUS enginebus
  2.     "yourtype":{
    EngineBUS enginebus EngineBUS enginebus
  3.         "_source":{
    EngineBUS enginebus EngineBUS enginebus
  4.             "excludes":["field1","field2"]
    EngineBUS enginebus EngineBUS enginebus
  5.         },
    EngineBUS enginebus EngineBUS enginebus
  6.         "properties": {
    EngineBUS enginebus EngineBUS enginebus
  7.             ...
    EngineBUS enginebus EngineBUS enginebus
  8.         }
    EngineBUS enginebus EngineBUS enginebus
  9.     }
    EngineBUS enginebus EngineBUS enginebus
  10. }
复制代码
测试,首先创建一个索引
EngineBUS enginebus EngineBUS enginebus
  1. PUT test
复制代码
设置mapping,禁用_source:
EngineBUS enginebus EngineBUS enginebus
  1. PUT test/test/_mapping
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.    "test": {
    EngineBUS enginebus EngineBUS enginebus
  4.       "_source": {
    EngineBUS enginebus EngineBUS enginebus
  5.          "enabled": false
    EngineBUS enginebus EngineBUS enginebus
  6.       }
    EngineBUS enginebus EngineBUS enginebus
  7.    }
    EngineBUS enginebus EngineBUS enginebus
  8. }
复制代码

EngineBUS enginebus EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus写入一条文档:
EngineBUS enginebus EngineBUS enginebus
  1. POST test/test/1
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.     "title":"我是中国人",
    EngineBUS enginebus EngineBUS enginebus
  4.     "content":"热爱共产党"
    EngineBUS enginebus EngineBUS enginebus
  5. }
复制代码
搜索关键词”中国人”:
EngineBUS enginebus EngineBUS enginebus
  1. GET test/_search
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.     "query": {
    EngineBUS enginebus EngineBUS enginebus
  4.         "match": {
    EngineBUS enginebus EngineBUS enginebus
  5.            "title": "中国人"
    EngineBUS enginebus EngineBUS enginebus
  6.         }
    EngineBUS enginebus EngineBUS enginebus
  7.     }
    EngineBUS enginebus EngineBUS enginebus
  8. }
    EngineBUS enginebus EngineBUS enginebus
  9. {
    EngineBUS enginebus EngineBUS enginebus
  10.    "took": 9,
    EngineBUS enginebus EngineBUS enginebus
  11.    "timed_out": false,
    EngineBUS enginebus EngineBUS enginebus
  12.    "_shards": {
    EngineBUS enginebus EngineBUS enginebus
  13.       "total": 5,
    EngineBUS enginebus EngineBUS enginebus
  14.       "successful": 5,
    EngineBUS enginebus EngineBUS enginebus
  15.       "failed": 0
    EngineBUS enginebus EngineBUS enginebus
  16.    },
    EngineBUS enginebus EngineBUS enginebus
  17.    "hits": {
    EngineBUS enginebus EngineBUS enginebus
  18.       "total": 1,
    EngineBUS enginebus EngineBUS enginebus
  19.       "max_score": 0.30685282,
    EngineBUS enginebus EngineBUS enginebus
  20.       "hits": [
    EngineBUS enginebus EngineBUS enginebus
  21.          {
    EngineBUS enginebus EngineBUS enginebus
  22.             "_index": "test",
    EngineBUS enginebus EngineBUS enginebus
  23.             "_type": "test",
    EngineBUS enginebus EngineBUS enginebus
  24.             "_id": "1",
    EngineBUS enginebus EngineBUS enginebus
  25.             "_score": 0.30685282
    EngineBUS enginebus EngineBUS enginebus
  26.          }
    EngineBUS enginebus EngineBUS enginebus
  27.       ]
    EngineBUS enginebus EngineBUS enginebus
  28.    }
    EngineBUS enginebus EngineBUS enginebus
  29. }
复制代码
从返回结果中可以看到,搜到了一条文档,但是禁用_source以后查询结果中不会再返回文档原始内容。(注,测试基于ELasticsearch 2.3.3,配置文件中已默认指定ik分词。)
二、_all配置
_all字段默认是关闭的,如果要开启_all字段,索引增大是不言而喻的。_all字段开启适用于不指定搜索某一个字段,根据关键词,搜索整个文档内容。
EngineBUS enginebus EngineBUS enginebus开启_all字段的方法和_source类似,mapping中的配置如下:
  1. {
    EngineBUS enginebus EngineBUS enginebus
  2.    "yourtype": {
    EngineBUS enginebus EngineBUS enginebus
  3.       "_all": {
    EngineBUS enginebus EngineBUS enginebus
  4.          "enabled": true
    EngineBUS enginebus EngineBUS enginebus
  5.       },
    EngineBUS enginebus EngineBUS enginebus
  6.       "properties": {
    EngineBUS enginebus EngineBUS enginebus
  7.             ...
    EngineBUS enginebus EngineBUS enginebus
  8.       }
    EngineBUS enginebus EngineBUS enginebus
  9.    }
    EngineBUS enginebus EngineBUS enginebus
  10. }
复制代码

EngineBUS enginebus EngineBUS enginebus
也可以通过在字段中指定某个字段是否包含在_all中:
  1. {
    EngineBUS enginebus EngineBUS enginebus
  2.    "yourtype": {
    EngineBUS enginebus EngineBUS enginebus
  3.       "properties": {
    EngineBUS enginebus EngineBUS enginebus
  4.          "field1": {
    EngineBUS enginebus EngineBUS enginebus
  5.              "type": "string",
    EngineBUS enginebus EngineBUS enginebus
  6.              "include_in_all": false
    EngineBUS enginebus EngineBUS enginebus
  7.           },
    EngineBUS enginebus EngineBUS enginebus
  8.           "field2": {
    EngineBUS enginebus EngineBUS enginebus
  9.              "type": "string",
    EngineBUS enginebus EngineBUS enginebus
  10.              "include_in_all": true
    EngineBUS enginebus EngineBUS enginebus
  11.           }
    EngineBUS enginebus EngineBUS enginebus
  12.       }
    EngineBUS enginebus EngineBUS enginebus
  13.    }
    EngineBUS enginebus EngineBUS enginebus
  14. }
复制代码
如果要把字段原始值保存,要设置store属性为true,这样索引会更大,需要根据需求使用。下面给出测试代码。
EngineBUS enginebus EngineBUS enginebus创建test索引:
  1. DELETE test
    EngineBUS enginebus EngineBUS enginebus
  2. PUT test
复制代码
设置mapping,这里设置所有字段都保存在_all中并且存储原始值:
  1. PUT test/test/_mapping
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.    "test": {
    EngineBUS enginebus EngineBUS enginebus
  4.       "_all": {
    EngineBUS enginebus EngineBUS enginebus
  5.          "enabled": true,
    EngineBUS enginebus EngineBUS enginebus
  6.          "store": true
    EngineBUS enginebus EngineBUS enginebus
  7.       }
    EngineBUS enginebus EngineBUS enginebus
  8.    }
    EngineBUS enginebus EngineBUS enginebus
  9. }
复制代码
插入文档:
  1. POST test/test/1
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.     "title":"我是中国人",
    EngineBUS enginebus EngineBUS enginebus
  4.     "content":"热爱共产党"
    EngineBUS enginebus EngineBUS enginebus
  5. }
复制代码
对_all字段进行搜索并高亮:
  1. POST test/_search
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.    "fields": ["_all"],
    EngineBUS enginebus EngineBUS enginebus
  4.    "query": {
    EngineBUS enginebus EngineBUS enginebus
  5.       "match": {
    EngineBUS enginebus EngineBUS enginebus
  6.          "_all": "中国人"
    EngineBUS enginebus EngineBUS enginebus
  7.       }
    EngineBUS enginebus EngineBUS enginebus
  8.    },
    EngineBUS enginebus EngineBUS enginebus
  9.    "highlight": {
    EngineBUS enginebus EngineBUS enginebus
  10.       "fields": {
    EngineBUS enginebus EngineBUS enginebus
  11.          "_all": {}
    EngineBUS enginebus EngineBUS enginebus
  12.       }
    EngineBUS enginebus EngineBUS enginebus
  13.    }
    EngineBUS enginebus EngineBUS enginebus
  14. }
    EngineBUS enginebus EngineBUS enginebus
  15. {
    EngineBUS enginebus EngineBUS enginebus
  16.    "took": 3,
    EngineBUS enginebus EngineBUS enginebus
  17.    "timed_out": false,
    EngineBUS enginebus EngineBUS enginebus
  18.    "_shards": {
    EngineBUS enginebus EngineBUS enginebus
  19.       "total": 5,
    EngineBUS enginebus EngineBUS enginebus
  20.       "successful": 5,
    EngineBUS enginebus EngineBUS enginebus
  21.       "failed": 0
    EngineBUS enginebus EngineBUS enginebus
  22.    },
    EngineBUS enginebus EngineBUS enginebus
  23.    "hits": {
    EngineBUS enginebus EngineBUS enginebus
  24.       "total": 1,
    EngineBUS enginebus EngineBUS enginebus
  25.       "max_score": 0.15342641,
    EngineBUS enginebus EngineBUS enginebus
  26.       "hits": [
    EngineBUS enginebus EngineBUS enginebus
  27.          {
    EngineBUS enginebus EngineBUS enginebus
  28.             "_index": "test",
    EngineBUS enginebus EngineBUS enginebus
  29.             "_type": "test",
    EngineBUS enginebus EngineBUS enginebus
  30.             "_id": "1",
    EngineBUS enginebus EngineBUS enginebus
  31.             "_score": 0.15342641,
    EngineBUS enginebus EngineBUS enginebus
  32.             "_all": "我是中国人 热爱共产党 ",
    EngineBUS enginebus EngineBUS enginebus
  33.             "highlight": {
    EngineBUS enginebus EngineBUS enginebus
  34.                "_all": [
    EngineBUS enginebus EngineBUS enginebus
  35.                   "我是<em>中国人</em> 热爱共产党 "
    EngineBUS enginebus EngineBUS enginebus
  36.                ]
    EngineBUS enginebus EngineBUS enginebus
  37.             }
    EngineBUS enginebus EngineBUS enginebus
  38.          }
    EngineBUS enginebus EngineBUS enginebus
  39.       ]
    EngineBUS enginebus EngineBUS enginebus
  40.    }
    EngineBUS enginebus EngineBUS enginebus
  41. }
复制代码
Elasticsearch中的query_string和simple_query_string默认就是查询_all字段,示例如下:
  1. GET test/_search
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.     "query": {
    EngineBUS enginebus EngineBUS enginebus
  4.         "query_string": {
    EngineBUS enginebus EngineBUS enginebus
  5.            "query": "共产党"
    EngineBUS enginebus EngineBUS enginebus
  6.         }
    EngineBUS enginebus EngineBUS enginebus
  7.     }
    EngineBUS enginebus EngineBUS enginebus
  8. }
复制代码
三、index和score配置
index和store属性实在字段内进行设置的,下面给出一个例子,设置test索引不保存_source,title字段索引但不分析,字段原始值写入索引,content字段为默认属性,代码如下:
  1. DELETE  test
    EngineBUS enginebus EngineBUS enginebus
  2. PUT test
    EngineBUS enginebus EngineBUS enginebus
  3. PUT test/test/_mapping
    EngineBUS enginebus EngineBUS enginebus
  4. {
    EngineBUS enginebus EngineBUS enginebus
  5.    "test": {
    EngineBUS enginebus EngineBUS enginebus
  6.       "_source": {
    EngineBUS enginebus EngineBUS enginebus
  7.          "enabled": false
    EngineBUS enginebus EngineBUS enginebus
  8.       },
    EngineBUS enginebus EngineBUS enginebus
  9.       "properties": {
    EngineBUS enginebus EngineBUS enginebus
  10.          "title": {
    EngineBUS enginebus EngineBUS enginebus
  11.             "type": "string",
    EngineBUS enginebus EngineBUS enginebus
  12.             "index": "not_analyzed",
    EngineBUS enginebus EngineBUS enginebus
  13.             "store": "true"
    EngineBUS enginebus EngineBUS enginebus
  14.          },
    EngineBUS enginebus EngineBUS enginebus
  15.          "content": {
    EngineBUS enginebus EngineBUS enginebus
  16.             "type": "string"
    EngineBUS enginebus EngineBUS enginebus
  17.          }
    EngineBUS enginebus EngineBUS enginebus
  18.       }
    EngineBUS enginebus EngineBUS enginebus
  19.    }
    EngineBUS enginebus EngineBUS enginebus
  20. }
复制代码
对title字段进行搜索并高亮,代码如下:
  1. GET test/_search
    EngineBUS enginebus EngineBUS enginebus
  2. {
    EngineBUS enginebus EngineBUS enginebus
  3.     "query": {
    EngineBUS enginebus EngineBUS enginebus
  4.         "match": {
    EngineBUS enginebus EngineBUS enginebus
  5.            "title": "我是中国人"
    EngineBUS enginebus EngineBUS enginebus
  6.         }
    EngineBUS enginebus EngineBUS enginebus
  7.     },
    EngineBUS enginebus EngineBUS enginebus
  8.    "highlight": {
    EngineBUS enginebus EngineBUS enginebus
  9.       "fields": {
    EngineBUS enginebus EngineBUS enginebus
  10.          "title": {}
    EngineBUS enginebus EngineBUS enginebus
  11.       }
    EngineBUS enginebus EngineBUS enginebus
  12.    }
    EngineBUS enginebus EngineBUS enginebus
  13. }
    EngineBUS enginebus EngineBUS enginebus
  14. {
    EngineBUS enginebus EngineBUS enginebus
  15.    "took": 6,
    EngineBUS enginebus EngineBUS enginebus
  16.    "timed_out": false,
    EngineBUS enginebus EngineBUS enginebus
  17.    "_shards": {
    EngineBUS enginebus EngineBUS enginebus
  18.       "total": 5,
    EngineBUS enginebus EngineBUS enginebus
  19.       "successful": 5,
    EngineBUS enginebus EngineBUS enginebus
  20.       "failed": 0
    EngineBUS enginebus EngineBUS enginebus
  21.    },
    EngineBUS enginebus EngineBUS enginebus
  22.    "hits": {
    EngineBUS enginebus EngineBUS enginebus
  23.       "total": 1,
    EngineBUS enginebus EngineBUS enginebus
  24.       "max_score": 0.30685282,
    EngineBUS enginebus EngineBUS enginebus
  25.       "hits": [
    EngineBUS enginebus EngineBUS enginebus
  26.          {
    EngineBUS enginebus EngineBUS enginebus
  27.             "_index": "test",
    EngineBUS enginebus EngineBUS enginebus
  28.             "_type": "test",
    EngineBUS enginebus EngineBUS enginebus
  29.             "_id": "1",
    EngineBUS enginebus EngineBUS enginebus
  30.             "_score": 0.30685282,
    EngineBUS enginebus EngineBUS enginebus
  31.             "highlight": {
    EngineBUS enginebus EngineBUS enginebus
  32.                "title": [
    EngineBUS enginebus EngineBUS enginebus
  33.                   "<em>我是中国人</em>"
    EngineBUS enginebus EngineBUS enginebus
  34.                ]
    EngineBUS enginebus EngineBUS enginebus
  35.             }
    EngineBUS enginebus EngineBUS enginebus
  36.          }
    EngineBUS enginebus EngineBUS enginebus
  37.       ]
    EngineBUS enginebus EngineBUS enginebus
  38.    }
    EngineBUS enginebus EngineBUS enginebus
  39. }
复制代码

EngineBUS enginebus EngineBUS enginebus
从返回结果中可以看到,虽然没有保存title字段到_source, 但是依然可以实现搜索高亮。
四、总结
通过图解和代码测试,对Elasticsearch中的_source、_all、store和index进行了详解,相信很容易明白。错误和疏漏之处,欢迎批评指正。

EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?Sign Up

x
您需要登录后才可以回帖 登录 | Sign Up

本版积分规则

推荐阅读

QQ| Archiver|手机版|小黑屋| 引擎巴士 EngineBUS  

Powered by Discuz! X3.2© 2001-2013 Comsenz Inc.  

返回顶部 返回列表