Elasticsearch：使用 Metric Aggregation 的结果来过滤 bucket 的元素和运行额外的聚合

Question

给定一个像

这样的数据集

[{
    "type": "A",
    "value": 32
}, {
    "type": "A",
    "value": 34
}, {
    "type": "B",
    "value": 35
}]

我想执行以下聚合：

首先，我想使用术语按 "type" 分组聚合。
之后，我想使用 extended_stats.
知道std_deviation_bounds（上下）我愿意计算不包括的桶中元素的平均值那些在 [std_deviation_bounds.lower 范围之外的人， std_deviation_bounds.upper]

我列表的第一点和第二点是微不足道的。我想知道第三点，利用sibling metric聚合结果的信息过滤掉bucket的元素，重新计算平均值是否可行。如果是的话，我想知道我需要使用的聚合结构。

Elasticsearch实例版本为5.0.0

Answer 1

嗯，OP在这里。

我仍然不知道 ElasticSearch 是否允许按照我在原始问题中描述的那样制定聚合。

我为解决这个问题所做的是采用不同的方法。我将 post 放在这里以防万一它对其他人有帮助。

所以，

POST hostname:9200/index/type/_search

和

{
    "query": {
        "match_all": {}
    },
    "size": 0,
    "aggs": {
        "group": {
            "terms": {
                "field": "type"
            },
            "aggs": {
                "histogramAgg": {
                    "histogram": {
                        "field": "value",
                        "interval": 10,
                        "offset": 0,
                        "order": {
                            "_key": "asc"
                        },
                        "keyed": true,
                        "min_doc_count": 0
                    },
                    "aggs": {
                        "statsAgg": {
                            "stats": {
                                "field": "value"
                            }
                        }
                    }
                },
                "extStatsAgg": {
                    "extended_stats": {
                        "field": "value",
                        "sigma": 2
                    }
                }
            }
        }
    }
}

会生成这样的结果

{
    "took": 100,
    "timed_out": false,
    "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
    },
    "hits": {
        "total": 100000,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "group": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [{
                "key": "A",
                "doc_count": 10000,
                "histogramAgg": {
                    "buckets": {
                        "0.0": {
                            "key": 0.0,
                            "doc_count": 1234,
                            "statsAgg": {
                                "count": 1234,
                                "min": 0.0,
                                "max": 9.0,
                                "avg": 0.004974220783280196,
                                "sum": 7559.0
                            }
                        },
                        "10.0": {
                            "key": 10.0,
                            "doc_count": 4567,
                            "statsAgg": {
                                "count": 4567,
                                "min": 10.0,
                                "max": 19.0,
                                "avg": 15.544345993923,
                                "sum": 331846.0
                            }
                        },
                        [...]
                    }
                },
                "extStatsAgg": {
                    "count": 10000,
                    "min": 0.0,
                    "max": 104.0,
                    "avg": 16.855123857,
                    "sum": 399079395E10,
                    "sum_of_squares": 3.734838645273888E15,
                    "variance": 1.2690056384124432E9,
                    "std_deviation": 35.10540102369,
                    "std_deviation_bounds": {
                        "upper": 87.06592590438,
                        "lower": -54.35567819038
                    }
                }
            },
            [...]
            ]
        }
    }
}

如果您注意 group 类型聚合的结果："A" 您会注意到我们现在有每个子组的平均值和计数的直方图。您也会注意到 extStatsAgg 聚合（直方图 聚合的兄弟）的结果显示 std_deviation_bounds 对于每个存储桶组（类型："A"，类型："B"，...）

您可能已经注意到，这并没有提供我正在寻找的解决方案。我需要对我的代码进行一些计算。伪代码示例

for bucket in buckets_groupAggregation

    Long totalCount = 0
    Double accumWeightedAverage = 0.0

    ExtendedStats extendedStats = bucket.extendedStatsAggregation
    Double upperLimit = extendedStats.std_deviation_bounds.upper
    Double lowerLimit = extendedStats.std_deviation_bounds.lower

    Histogram histogram = bucket.histogramAggregation
    for group in histogram

        Stats stats = group.statsAggregation

        if group.key > lowerLimit & group.key < upperLimit
            totalCount += group.count
            accumWeightedAverage += group.count * stats.average

    Double average = accumWeightedAverage / totalCount

备注：直方图区间的大小将决定最终平均值的"accuracy"。更细的间隔会得到更准确的结果，同时增加聚合时间。

希望对其他人有所帮助

Elasticsearch：使用 Metric Aggregation 的结果来过滤 bucket 的元素和运行额外的聚合

Elasticsearch: Using the results of a Metric Aggregation to filter the elements of a bucket and run additional aggregations

elasticsearch

elasticsearch-aggregation

elasticsearch-5

Elasticsearch：使用 Metric Aggregation 的结果来过滤 bucket 的元素和 运行 额外的聚合

Elasticsearch: Using the results of a Metric Aggregation to filter the elements of a bucket and run additional aggregations

elasticsearch

elasticsearch-aggregation

elasticsearch-5

Elasticsearch：使用 Metric Aggregation 的结果来过滤 bucket 的元素和运行额外的聚合