如何根据 MongoDB 中的 id 和 datetime 字段查找重复记录？

Question

我有一个 MongoDB 集合，其中包含数百万条记录。示例记录如下：

[
  {
    _id: ObjectId("609977b0e8e1c615cb551bf5"),
    activityId: "123456789",
    updateDateTime: "2021-03-24T20:12:02Z"
  },
  {
    _id: ObjectId("739177b0e8e1c615cb551bf5"),
    activityId: "123456789",
    updateDateTime: "2021-03-24T20:15:02Z"
  },
  {
    _id: ObjectId("805577b0e8e1c615cb551bf5"),
    activityId: "123456789",
    updateDateTime: "2021-03-24T20:18:02Z"
  }
]

多个记录可能具有相同的 activityId，在这种情况下，我只想要具有最大 updateDateTime.

的记录

我试过这样做，它在较小的集合上运行良好，但在大型集合上超时。

[
  {
    $lookup: {
      from: "MY_TABLE",
      let: {
        existing_date: "$updateDateTime",
        existing_sensorActivityId: "$activityId"
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ["$activityId", "$$existing_sensorActivityId"] },
                { $gt: ["$updateDateTime", "$$existing_date"] }
              ]
            }
          }
        }
      ],
      as: "matched_records"
    }
  },
  { $match: { "matched_records.0": { $exists: true } } },
  { $project: { _id: 1 } }
]

这为所有具有相同 activity ID 但较小 updateDateTime.

的记录提供了 _ids

慢发生在这一步 -> "matched_records.0": {$exists:true}

有没有办法加快这个步骤，或者有任何其他方法可以解决这个问题？

Answer 1

您可以使用 $out 查找唯一文档并将结果写入新集合，而不是查找重复文档并将其删除，

如何找到独特的文件？

$sort 按 updateDateTime 降序排列
$group by activityId 得到第一个根记录
$replaceRoot 替换根目录中的记录
$out 将查询结果写入新集合

[
  { $sort: { updateDateTime: -1 } },
  {
    $group: {
      _id: "$activityId",
      record: { $first: "$$ROOT" }
    }
  },
  { $replaceRoot: { newRoot: "$record" } },
  { $out: "newCollectionName" } // set new collection name
]

Playground

如何根据 MongoDB 中的 id 和 datetime 字段查找重复记录？

How to find duplicate records based on an id and a datetime field in MongoDB?

group-by

mongodb

nosql

aggregation-framework