ElasticSearch + Kibana - 使用预先计算的哈希值的唯一计数

Question

更新：添加

我想在我的 ElasticSearch 集群上执行唯一计数。该集群包含大约 5000 万条记录。

我试过以下方法：

第一种方法

在this section中提到：

Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.

第二种方法

在this section中提到：

Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.

我的属性映射

"my_prop": {
  "index": "not_analyzed",
  "fielddata": {
    "format": "doc_values"
  },
  "doc_values": true,
  "type": "string",
  "fields": {
    "hash": {
      "type": "murmur3"
    }
  }
}

问题

当我在 Kibana 中对 my_prop.hash 使用唯一计数时，我收到以下错误：

Data too large, data for [my_prop.hash] would be larger than limit

ElasticSearch 的堆大小为 2g。对于具有 400 万条记录的单个索引，上述操作也失败了。

我的问题

我的配置中是否遗漏了什么？
我应该增加我的机器吗？这似乎不是可扩展的解决方案。

ElasticSearch 查询

由 Kibana 生成： http://pastebin.com/hf1yNLhE

ElasticSearch 堆栈跟踪

http://pastebin.com/BFTYUsVg

Answer 1

该错误表明您没有足够的内存（更具体地说，fielddata 的内存）来存储 hash 中的所有值，因此您需要将它们从堆中取出并将它们放在磁盘上，意思是使用 doc_values.

因为你已经在使用 doc_values 作为 my_prop 我建议对 my_prop.hash 做同样的事情（而且，不，主字段的设置不会被子字段继承字段）："hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.