如何使用 Nest ElasticClient 基于 Url 域名执行术语聚合
How to perform term aggregation based on Url Domain name using Nest ElasticClient
我想在 uri 字段上执行聚合,但 return 仅 url 的域部分而不是完整的 url。例如,对于字段 https://whosebug.com/questions/ask?guided=true
我会得到 whosebug.com
给定如下现有数据集:
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "L9WewGoBZqCeOmbRIMlV",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.espnfc.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308177Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "MNWewGoBZqCeOmbRIsma",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.org/wiki/Category:Terrorism_in_Mexico",
"date" : "2019-05-16T12:29:08.1308803Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "2V-ewGoBiHg_1GebJKIr",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308811Z"
}
}
]
我的桶应该是这样的:
"buckets" : [
{
"key" : "www.espnfc.com",
"doc_count" : 1
},
{
"key" : "www.wikipedia.com",
"doc_count" : 2
}
]
我有以下关于如何进行聚合的代码片段,但是这个聚合基于完整 url 而不是域名
var searchResponse = client.Search<Person>(s =>
s.Size(0)
.Query(q => q
.MatchAll()
)
.Aggregations(a => a
.Terms("visited_pages", ta => ta
.Field(f => f.PageUri.Suffix("keyword"))
)
)
);
var aggregations = searchResponse.Aggregations.Terms("visited_pages");
如有任何帮助,我们将不胜感激:)
我建议在摄取期间将该数据分解到另一个字段(类似于 "topleveldomain"),否则 Elasticsearch 必须为每个文档做大量工作才能进行聚合。
我使用了下面的Terms Aggregation using Script。
请注意,通过查看您的数据,我得出了字符串逻辑。对它进行测试并根据您要查找的内容修改逻辑。
最好的方法是尝试有一个名为 hostname
的单独字段,其中包含您要查找的内容的值,并在其上应用聚合。
但是,如果您遇到困难,我想下面的汇总可以提供帮助!!
聚合查询:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_unique_urls": {
"terms": {
"script" : {
"inline": """
String st = doc['pageUri.keyword'].value;
if(st==null){
return "";
} else {
return st.substring(0, st.lastIndexOf(".")+4);
}
""",
"lang": "painless"
}
}
}
}
}
下面是我的回复:
查询响应:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_unique_urls": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "http://www.espnfc.com",
"doc_count": 1
},
{
"key": "http://www.wikipedia.org",
"doc_count": 1
},
{
"key": "https://en.wikipedia.org",
"doc_count": 1
}
]
}
}
}
希望对您有所帮助!
我想在 uri 字段上执行聚合,但 return 仅 url 的域部分而不是完整的 url。例如,对于字段 https://whosebug.com/questions/ask?guided=true
我会得到 whosebug.com
给定如下现有数据集:
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "L9WewGoBZqCeOmbRIMlV",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.espnfc.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308177Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "MNWewGoBZqCeOmbRIsma",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.org/wiki/Category:Terrorism_in_Mexico",
"date" : "2019-05-16T12:29:08.1308803Z"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "2V-ewGoBiHg_1GebJKIr",
"_score" : 1.0,
"_source" : {
"firstName" : "George",
"lastName" : "Ouma",
"pageUri" : "http://www.wikipedia.com/story/683732/england-football-team-escaped-terrorist-attack-at-1998-world-cup",
"date" : "2019-05-16T12:29:08.1308811Z"
}
}
]
我的桶应该是这样的:
"buckets" : [
{
"key" : "www.espnfc.com",
"doc_count" : 1
},
{
"key" : "www.wikipedia.com",
"doc_count" : 2
}
]
我有以下关于如何进行聚合的代码片段,但是这个聚合基于完整 url 而不是域名
var searchResponse = client.Search<Person>(s =>
s.Size(0)
.Query(q => q
.MatchAll()
)
.Aggregations(a => a
.Terms("visited_pages", ta => ta
.Field(f => f.PageUri.Suffix("keyword"))
)
)
);
var aggregations = searchResponse.Aggregations.Terms("visited_pages");
如有任何帮助,我们将不胜感激:)
我建议在摄取期间将该数据分解到另一个字段(类似于 "topleveldomain"),否则 Elasticsearch 必须为每个文档做大量工作才能进行聚合。
我使用了下面的Terms Aggregation using Script。
请注意,通过查看您的数据,我得出了字符串逻辑。对它进行测试并根据您要查找的内容修改逻辑。
最好的方法是尝试有一个名为 hostname
的单独字段,其中包含您要查找的内容的值,并在其上应用聚合。
但是,如果您遇到困难,我想下面的汇总可以提供帮助!!
聚合查询:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_unique_urls": {
"terms": {
"script" : {
"inline": """
String st = doc['pageUri.keyword'].value;
if(st==null){
return "";
} else {
return st.substring(0, st.lastIndexOf(".")+4);
}
""",
"lang": "painless"
}
}
}
}
}
下面是我的回复:
查询响应:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_unique_urls": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "http://www.espnfc.com",
"doc_count": 1
},
{
"key": "http://www.wikipedia.org",
"doc_count": 1
},
{
"key": "https://en.wikipedia.org",
"doc_count": 1
}
]
}
}
}
希望对您有所帮助!