MongoDB 集合到 pandas 数据框

MongoDB collection to pandas Dataframe

我的MongoDB文档结构如下,部分因子为NaN

  _id :ObjectId("5feddb959297bb2625db1450")
factors: Array 
   0:Object
     factorId:"C24"
     Index:0
     weight:1
   1:Object
     factorId:"C25"
     Index:1
     weight:1
   2:Object
     factorId:"C26"
     Index:2
     weight:1
name:"Growth Led Momentum"

我想使用 pymongo 和 pandas 将其转换为 pandas 数据框,如下所示。

|name                   | factorId | Index | weight|
----------------------------------------------------
|Growth Led Momentum    | C24      | 0     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C25      | 1     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C26      | 2     | 0     |
----------------------------------------------------

谢谢

更新

我破解了 ol Python 来破解它 - 以下代码完美无缺!

from pymongo import MongoClient
import pandas as pd

uri = "mongodb://<your_mongo_uri>:27017"
database_name = "<your_database_name"
collection_name = "<your_collection_name>"

mongo_client = MongoClient(uri)
database = mongo_client[database_name]
collection = database[collection_name]

# I used this code to insert a doc into a test collection
# before querying (just incase you wanted to know lol)
"""
data = {
    "_id": 1,
    "name": "Growth Lead Momentum",
    "factors": [
        {
            "factorId": "C24",
            "index": 0,
            "weight": 1
        },
        {
            "factorId": "D74",
            "index": 7,
            "weight": 9
        }
    ]
}

insert_result = collection.insert_one(data)
print(insert_result)
"""

# This is the query that
# answers your question

results = collection.aggregate([
  {
    "$unwind": "$factors"
  },
  {
    "$project": {
      "_id": 1, # Change to 0 if you wish to ignore "_id" field.
      "name": 1,
      "factorId": "$factors.factorId",
      "index": "$factors.index",
      "weight": "$factors.weight"
    }
  }
])

# This is how we turn the results into a DataFrame.
# We can simply pass `list(results)` into `DataFrame(..)`,
# due to how our query works.

results_as_dataframe = pd.DataFrame(list(results))
print(results_as_dataframe)

输出:

   _id                  name factorId  index  weight
0    1  Growth Lead Momentum      C24      0       1
1    1  Growth Lead Momentum      D74      7       9

原答案

您可以使用聚合管道展开 factors,然后投影您想要的字段。

像这样应该可以解决问题。

直播demo here.

数据库结构

[
  {
    "_id": 1,
    "name": "Growth Lead Momentum",
    "factors": [
      {
        factorId: "C24",
        index: 0,
        weight: 1
      },
      {
        factorId: "D74",
        index: 7,
        weight: 9
      }
    ]
  }
]

查询

db.collection.aggregate([
  {
    $unwind: "$factors"
  },
  {
    $project: {
      _id: 1,
      name: 1,
      factorId: "$factors.factorId",
      index: "$factors.index",
      weight: "$factors.weight"
    }
  }
])

结果

(.csv 友好)

[
  {
    "_id": 1,
    "factorId": "C24",
    "index": 0,
    "name": "Growth Lead Momentum",
    "weight": 1
  },
  {
    "_id": 1,
    "factorId": "D74",
    "index": 7,
    "name": "Growth Lead Momentum",
    "weight": 9
  }
]

Matt 的精彩回答,如果您想使用 pandas:

从 db:

检索文档后使用它
df = pd.json_normalize(data)
df = df['factors'].explode().apply(lambda x: [val for _, val in x.items()]).explode().apply(pd.Series).join(df).drop(columns=['factors'])

输出:

  factorId  Index  weight                 name
0      C24      0       1  Growth Led Momentum
0      C25      1       1  Growth Led Momentum
0      C26      2       1  Growth Led Momentum