如何通过嵌套字典进行聚合，对其值求和，并相应地对它们进行排序？

Question

我有一个 MongoDB 数据库，其中包含文档级别的单词频率，如下所示。我有大约 175k 个相同格式的文档，总共约 2.5GB。

{
    "_id": xxx,
    "title": "zzz",
    "vectors": {
        "word1": 28,
        "word2": 22,
        "word3": 12,
        "word4": 7,
        "word5": 4
    }

现在我想遍历所有文档，计算每个词的所有频率之和，并根据频率得到这些词在 vectors 字段中的总排名：

{
    "vectors": {
        "word1": 223458,
        "word2": 98562,
        "word3": 76433,
        "word4": 4570,
        "word5": 2599
    }

$unwind 在这里似乎不起作用，因为我有一个嵌套字典。我是 MongoDB 的新手，我找不到具体的答案。有什么想法吗？

Answer 1

您必须使用 $objectToArray 将子对象的键转换为值，然后 $unwind 新转换的数组（$unwind 仅适用于数组字段，这就是为什么不适合你）。

最后，根据$vectors.k子对象key已经转为value的地方分组

db.collection.aggregate([
  {
    "$project": {
      "vectors": {
        "$objectToArray": "$vectors"
      }
    },
  },
  {
    "$unwind": "$vectors"
  },
  {
    "$group": {
      "_id": "$vectors.k",
      "count": {
        "$sum": "$vectors.v"
      },
    },
  },
])

Mongo Playground Sample Execution

如何通过嵌套字典进行聚合，对其值求和，并相应地对它们进行排序？

How to aggregate through nested dictionaries, sum its values, and rank them accordingly?

python

mongodb

pymongo

aggregation-framework