在 elasticsearch 中从 postgresql 对分区进行排名
Rank over partition from postgresql in elasticsearch
我们面临着将大型数据集从 postgres(备份或其他)迁移到 elasticsearch 的问题。
我们有类似这样的模式
+---------------+--------------+------------+-----------+
| user_id | created_at | latitude | longitude |
+---------------+--------------+------------+-----------+
| 5 | 23.1.2015 | 12.49 | 20.39 |
+---------------+--------------+------------+-----------+
| 2 | 23.1.2015 | 12.42 | 20.32 |
+---------------+--------------+------------+-----------+
| 2 | 24.1.2015 | 12.41 | 20.31 |
+---------------+--------------+------------+-----------+
| 5 | 25.1.2015 | 12.45 | 20.32 |
+---------------+--------------+------------+-----------+
| 1 | 23.1.2015 | 12.43 | 20.34 |
+---------------+--------------+------------+-----------+
| 1 | 24.1.2015 | 12.42 | 20.31 |
+---------------+--------------+------------+-----------+
并且我们能够找到 created_at 的最新位置,这要归功于 SQL
中的排名功能
... WITH locations AS (
select user_id, lat, lon, rank() over (partition by user_id order by created_at) as r
FROM locations)
SELECT user_id, lat, lon FROM locations WHERE r = 1
并且结果仅为每个用户最新创建的位置:
+---------------+--------------+------------+-----------+
| user_id | created_at | latitude | longitude |
+---------------+--------------+------------+-----------+
| 2 | 24.1.2015 | 12.41 | 20.31 |
+---------------+--------------+------------+-----------+
| 5 | 25.1.2015 | 12.45 | 20.32 |
+---------------+--------------+------------+-----------+
| 1 | 24.1.2015 | 12.42 | 20.31 |
+---------------+--------------+------------+-----------+
将数据导入elasticsearch后,我们的文档模型如下:
{
"location" : { "lat" : 12.45, "lon" : 46.84 },
"user_id" : 5,
"created_at" : "2015-01-24T07:55:20.606+00:00"
}
etc...
我正在 elasticsearch 查询中寻找这个 SQL 查询的替代方案,我认为它一定是可能的,但我还没有找到如何。
很简单:如果您想找到 最旧 的记录(对于给定的 ID),您只需要 没有更旧的记录(具有相同的id)存在。 (这假设对于给定的 id,不存在 相同 created_at 日期的记录)
SELECT * FROM locations ll
WHERE NOT EXISTS (
SELECT * FROM locations nx
WHERE nx.user_id = ll.user_id
AND nx.created_at > ll.created_at
);
EDITED(看来 OP 想要 newst 观察,而不是最旧的)
使用top_hits.
"aggs": {
"user_id": {
"terms": {"field": "user_id"},
"aggs": {
"top_location": {
"top_hits": {
"size": 1,
"sort": { "created_at": "asc" },
"_source": []
}
}
}
}
}
您可以使用 field collapsing
和 inner_hits
组合来实现此目的。
{
"collapse": {
"field": "user_id",
"inner_hits": {
"name": "order by created_at",
"size": 1,
"sort": [
{
"created_at": "desc"
}
]
}
},
}
详细文章:https://blog.francium.tech/sql-window-function-partition-by-in-elasticsearch-c2e3941495b6
我们面临着将大型数据集从 postgres(备份或其他)迁移到 elasticsearch 的问题。
我们有类似这样的模式
+---------------+--------------+------------+-----------+
| user_id | created_at | latitude | longitude |
+---------------+--------------+------------+-----------+
| 5 | 23.1.2015 | 12.49 | 20.39 |
+---------------+--------------+------------+-----------+
| 2 | 23.1.2015 | 12.42 | 20.32 |
+---------------+--------------+------------+-----------+
| 2 | 24.1.2015 | 12.41 | 20.31 |
+---------------+--------------+------------+-----------+
| 5 | 25.1.2015 | 12.45 | 20.32 |
+---------------+--------------+------------+-----------+
| 1 | 23.1.2015 | 12.43 | 20.34 |
+---------------+--------------+------------+-----------+
| 1 | 24.1.2015 | 12.42 | 20.31 |
+---------------+--------------+------------+-----------+
并且我们能够找到 created_at 的最新位置,这要归功于 SQL
中的排名功能... WITH locations AS (
select user_id, lat, lon, rank() over (partition by user_id order by created_at) as r
FROM locations)
SELECT user_id, lat, lon FROM locations WHERE r = 1
并且结果仅为每个用户最新创建的位置:
+---------------+--------------+------------+-----------+
| user_id | created_at | latitude | longitude |
+---------------+--------------+------------+-----------+
| 2 | 24.1.2015 | 12.41 | 20.31 |
+---------------+--------------+------------+-----------+
| 5 | 25.1.2015 | 12.45 | 20.32 |
+---------------+--------------+------------+-----------+
| 1 | 24.1.2015 | 12.42 | 20.31 |
+---------------+--------------+------------+-----------+
将数据导入elasticsearch后,我们的文档模型如下:
{
"location" : { "lat" : 12.45, "lon" : 46.84 },
"user_id" : 5,
"created_at" : "2015-01-24T07:55:20.606+00:00"
}
etc...
我正在 elasticsearch 查询中寻找这个 SQL 查询的替代方案,我认为它一定是可能的,但我还没有找到如何。
很简单:如果您想找到 最旧 的记录(对于给定的 ID),您只需要 没有更旧的记录(具有相同的id)存在。 (这假设对于给定的 id,不存在 相同 created_at 日期的记录)
SELECT * FROM locations ll
WHERE NOT EXISTS (
SELECT * FROM locations nx
WHERE nx.user_id = ll.user_id
AND nx.created_at > ll.created_at
);
EDITED(看来 OP 想要 newst 观察,而不是最旧的)
使用top_hits.
"aggs": {
"user_id": {
"terms": {"field": "user_id"},
"aggs": {
"top_location": {
"top_hits": {
"size": 1,
"sort": { "created_at": "asc" },
"_source": []
}
}
}
}
}
您可以使用 field collapsing
和 inner_hits
组合来实现此目的。
{
"collapse": {
"field": "user_id",
"inner_hits": {
"name": "order by created_at",
"size": 1,
"sort": [
{
"created_at": "desc"
}
]
}
},
}
详细文章:https://blog.francium.tech/sql-window-function-partition-by-in-elasticsearch-c2e3941495b6