SQL 喜欢 GROUP BY AND HAVING
SQL like GROUP BY AND HAVING
我想得到满足特定条件的组数。在 SQL 术语中,我想在 Elasticsearch 中执行以下操作。
SELECT COUNT(*) FROM
(
SELECT
senderResellerId,
SUM(requestAmountValue) AS t_amount
FROM
transactions
GROUP BY
senderResellerId
HAVING
t_amount > 10000 ) AS dum;
到目前为止,我可以按术语聚合按 senderResellerId 分组。但是当我应用过滤器时,它并没有按预期工作。
弹性请求
{
"aggregations": {
"reseller_sale_sum": {
"aggs": {
"sales": {
"aggregations": {
"reseller_sale": {
"sum": {
"field": "requestAmountValue"
}
}
},
"filter": {
"range": {
"reseller_sale": {
"gte": 10000
}
}
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales>reseller_sale": "desc"
},
"size": 5
}
}
},
"ext": {},
"query": { "match_all": {} },
"size": 0
}
实际响应
{
"took" : 21,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 150824,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"reseller_sale_sum" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 149609,
"buckets" : [
{
"key" : "RES0000000004",
"doc_count" : 8,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
},
{
"key" : "RES0000000005",
"doc_count" : 39,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
},
{
"key" : "RES0000000006",
"doc_count" : 57,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
},
{
"key" : "RES0000000007",
"doc_count" : 134,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
}
}
}
]
}
}
}
正如您从上面的响应中看到的,它正在返回经销商,但 reseller_sale 聚合结果为零。
更多详情here。
实施类似 HAVING 的行为
您可以使用 pipeline aggregations
, namely bucket selector aggregation 之一。查询将如下所示:
POST my_index/tdrs/_search
{
"aggregations": {
"reseller_sale_sum": {
"aggregations": {
"sales": {
"sum": {
"field": "requestAmountValue"
}
},
"max_sales": {
"bucket_selector": {
"buckets_path": {
"var1": "sales"
},
"script": "params.var1 > 10000"
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales": "desc"
},
"size": 5
}
}
},
"size": 0
}
将以下文档放入索引后:
"hits": [
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh5F-dSw48Z0DWDys",
"_score": 1,
"_source": {
"requestAmountValue": 7000,
"senderResellerId": "ID_1"
}
},
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh684dSw48Z0DWDyt",
"_score": 1,
"_source": {
"requestAmountValue": 5000,
"senderResellerId": "ID_1"
}
},
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh8TBdSw48Z0DWDyu",
"_score": 1,
"_source": {
"requestAmountValue": 1000,
"senderResellerId": "ID_2"
}
}
]
查询结果为:
"aggregations": {
"reseller_sale_sum": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ID_1",
"doc_count": 2,
"sales": {
"value": 12000
}
}
]
}
}
即仅那些 senderResellerId
的累计销售额为 >10000
.
数桶
要实现 SELECT COUNT(*) FROM (... HAVING)
的等价物,可以使用 bucket script aggregation with sum bucket aggregation 的组合。虽然似乎没有直接的方法来计算有多少桶 bucket_selector
实际上 select,我们可以定义一个 bucket_script
产生 0
或 1
取决于一个条件,以及产生其 sum
:
的 sum_bucket
POST my_index/tdrs/_search
{
"aggregations": {
"reseller_sale_sum": {
"aggregations": {
"sales": {
"sum": {
"field": "requestAmountValue"
}
},
"max_sales": {
"bucket_script": {
"buckets_path": {
"var1": "sales"
},
"script": "if (params.var1 > 10000) { 1 } else { 0 }"
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales": "desc"
}
}
},
"max_sales_stats": {
"sum_bucket": {
"buckets_path": "reseller_sale_sum>max_sales"
}
}
},
"size": 0
}
输出将是:
"aggregations": {
"reseller_sale_sum": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
...
]
},
"max_sales_stats": {
"value": 1
}
}
所需的存储桶计数位于 max_sales_stats.value
。
重要注意事项
我必须指出两点:
- 该功能是实验性的(从 ES 5.6 开始它仍然是实验性的,尽管它是在 2.0.0-beta1 中添加的)
- 管道聚合应用于先前聚合的结果:
Pipeline aggregations work on the outputs produced from other aggregations rather than
from document sets, adding information to the output tree.
这意味着 bucket_selector
聚合将在 senderResellerId
上 terms
聚合的结果之后应用。例如,如果 senderResellerId
多于 size
of terms
聚合定义,您将不会获得 all 集合中的 ids sum(sales) > 10000
,但仅出现在 terms
聚合输出中的那些。考虑使用排序 and/or 设置足够的 size
参数。
这也适用于第二种情况,COUNT() (... HAVING)
,它只会计算聚合输出中实际存在的那些桶。
如果此查询太繁重或桶数太大,请考虑 denormalizing your data or store this sum directly in the document, so you can use plain range
查询来实现您的目标。
我想得到满足特定条件的组数。在 SQL 术语中,我想在 Elasticsearch 中执行以下操作。
SELECT COUNT(*) FROM
(
SELECT
senderResellerId,
SUM(requestAmountValue) AS t_amount
FROM
transactions
GROUP BY
senderResellerId
HAVING
t_amount > 10000 ) AS dum;
到目前为止,我可以按术语聚合按 senderResellerId 分组。但是当我应用过滤器时,它并没有按预期工作。
弹性请求
{
"aggregations": {
"reseller_sale_sum": {
"aggs": {
"sales": {
"aggregations": {
"reseller_sale": {
"sum": {
"field": "requestAmountValue"
}
}
},
"filter": {
"range": {
"reseller_sale": {
"gte": 10000
}
}
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales>reseller_sale": "desc"
},
"size": 5
}
}
},
"ext": {},
"query": { "match_all": {} },
"size": 0
}
实际响应
{
"took" : 21,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 150824,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"reseller_sale_sum" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 149609,
"buckets" : [
{
"key" : "RES0000000004",
"doc_count" : 8,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
},
{
"key" : "RES0000000005",
"doc_count" : 39,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
},
{
"key" : "RES0000000006",
"doc_count" : 57,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
},
{
"key" : "RES0000000007",
"doc_count" : 134,
"sales" : {
"doc_count" : 0,
"reseller_sale" : {
"value" : 0.0
}
}
}
}
}
]
}
}
}
正如您从上面的响应中看到的,它正在返回经销商,但 reseller_sale 聚合结果为零。
更多详情here。
实施类似 HAVING 的行为
您可以使用 pipeline aggregations
, namely bucket selector aggregation 之一。查询将如下所示:
POST my_index/tdrs/_search
{
"aggregations": {
"reseller_sale_sum": {
"aggregations": {
"sales": {
"sum": {
"field": "requestAmountValue"
}
},
"max_sales": {
"bucket_selector": {
"buckets_path": {
"var1": "sales"
},
"script": "params.var1 > 10000"
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales": "desc"
},
"size": 5
}
}
},
"size": 0
}
将以下文档放入索引后:
"hits": [
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh5F-dSw48Z0DWDys",
"_score": 1,
"_source": {
"requestAmountValue": 7000,
"senderResellerId": "ID_1"
}
},
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh684dSw48Z0DWDyt",
"_score": 1,
"_source": {
"requestAmountValue": 5000,
"senderResellerId": "ID_1"
}
},
{
"_index": "my_index",
"_type": "tdrs",
"_id": "AV9Yh8TBdSw48Z0DWDyu",
"_score": 1,
"_source": {
"requestAmountValue": 1000,
"senderResellerId": "ID_2"
}
}
]
查询结果为:
"aggregations": {
"reseller_sale_sum": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ID_1",
"doc_count": 2,
"sales": {
"value": 12000
}
}
]
}
}
即仅那些 senderResellerId
的累计销售额为 >10000
.
数桶
要实现 SELECT COUNT(*) FROM (... HAVING)
的等价物,可以使用 bucket script aggregation with sum bucket aggregation 的组合。虽然似乎没有直接的方法来计算有多少桶 bucket_selector
实际上 select,我们可以定义一个 bucket_script
产生 0
或 1
取决于一个条件,以及产生其 sum
:
sum_bucket
POST my_index/tdrs/_search
{
"aggregations": {
"reseller_sale_sum": {
"aggregations": {
"sales": {
"sum": {
"field": "requestAmountValue"
}
},
"max_sales": {
"bucket_script": {
"buckets_path": {
"var1": "sales"
},
"script": "if (params.var1 > 10000) { 1 } else { 0 }"
}
}
},
"terms": {
"field": "senderResellerId",
"order": {
"sales": "desc"
}
}
},
"max_sales_stats": {
"sum_bucket": {
"buckets_path": "reseller_sale_sum>max_sales"
}
}
},
"size": 0
}
输出将是:
"aggregations": {
"reseller_sale_sum": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
...
]
},
"max_sales_stats": {
"value": 1
}
}
所需的存储桶计数位于 max_sales_stats.value
。
重要注意事项
我必须指出两点:
- 该功能是实验性的(从 ES 5.6 开始它仍然是实验性的,尽管它是在 2.0.0-beta1 中添加的)
- 管道聚合应用于先前聚合的结果:
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.
这意味着 bucket_selector
聚合将在 senderResellerId
上 terms
聚合的结果之后应用。例如,如果 senderResellerId
多于 size
of terms
聚合定义,您将不会获得 all 集合中的 ids sum(sales) > 10000
,但仅出现在 terms
聚合输出中的那些。考虑使用排序 and/or 设置足够的 size
参数。
这也适用于第二种情况,COUNT() (... HAVING)
,它只会计算聚合输出中实际存在的那些桶。
如果此查询太繁重或桶数太大,请考虑 denormalizing your data or store this sum directly in the document, so you can use plain range
查询来实现您的目标。