MongoDB 使用任意命名和编号的列进行数据透视/交叉表/转置
MongoDB Pivot / Crosstab / Transpose with arbitrarily named and numbered columns
我有一个 Mongo 数据库,其中包含导入的平面文件 CSV。在 SQL 中,这个文件无疑应该被规范化:文件包含每个句点一行,句点包含重复的信息。我创建了一个查询,该查询使用 'push' 运算符将(部分)重复信息聚合到行内的单个子对象中。这模仿规范化。我想做的是重组输出对象,以便子对象字典在顶层使用键和值。这在 SQL 中称为数据透视表查询或交叉表查询。在 Excel 中,它被称为转置。不管名称如何,我正在寻找的是获取键值对并将它们用作 Mongo.
中的 'columns' 的能力
由于 Mongo 和其他 NoSQL 数据库旨在非规范化实现,我很惊讶这这么难。
我正在尝试将以下 JSON 对象放入 Mongo:
[{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : 1}, {"sub_value": 2}] },
{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : 2}, {"sub_value": 5}] },
{ "_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : 2}, {"sub_value": 4}] },
{ "_id": {"Date": "1/1/2018", "Type": "Orange", "client_id": 1},
"Sub_data": [{"sub_id" : 6}, {"sub_value": 7}] }]
并得到以下结果:
[{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"1" : 2, "2":5},
{ "_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
"2" : 4},
{ "_id": {"Date": "1/2/2018", "Type": "Orange", "client_id": 1},
"6" : 7}]
请注意,我希望此结果有任意数量的列。我已经查看了一些 SEEM 解决问题的解决方案(Array to object, AddFields, , Something like a pivot using static columns) and I've read multiple versions of this 'do it afterwards' code. post-processing 是唯一的方法吗?
注意:这是试图模仿 SQL 服务器(和 Excel 等)描述的功能 in this Stack Overflow question and this TechNet article。
汇总起来,我使用第一个答案的第二个选项的整个管道如下所示:
db.rate_cards.aggregate(
{
"$group": {
"_id": {
"date": "$date",
"start_date": "$start_date",
"end_date": "$end_date"
},
"code_data": {
"$push": {
"code_str": {"$substr" : ["$code",0,-1]},
"cpm": "$cpm"
}
}
}
},
{
"$group":{
"_id":"$_id",
"data":{
"$mergeObjects":{
"$arrayToObject":[[
{
"k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$code_data",0]}},"in":"$$sub_id_elem.code_str"}},
"v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$code_data",1]}},"in":"$$sub_value_elem.cpm"}}
}
]]
}
}
}
},
{"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id",{"$arrayToObject":"$data"}]}}}
)
请注意,这比我希望的要复杂一些,而且性能要求更高。它似乎声明了一个局部变量,使用了一个 in-clause,等等。在尝试 运行 两个答案的(有效)实施时,NoSQL 助推器扼流圈试图扩展第 600 行。
下面是原始数据集的略微编辑版本。请注意,原始查询中未使用的一些额外字段已被省略:
{
"_id" : ObjectId("5a578d5c57d33b197004beed"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 12m-6a",
"dsc" : "Daypart",
"net" : "val1",
"place" : "loc1",
"code" : 12,
"cost" : 16.8
},
{
"_id" : ObjectId("5a578d5c57d33b197004beee"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 12m-6a",
"dsc" : "Daypart",
"net" : "val1",
"place" : "loc3",
"code" : 24,
"cost" : 55.6
},
{
"_id" : ObjectId("5a578d5c57d33b197004beef"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 12n-6p",
"dsc" : "Daypart",
"net" : "val2",
"place" : "loc2",
"code" : 23,
"cost" : 65.5
},
{
"_id" : ObjectId("5a578d5c57d33b197004bef0"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 6p-12m",
"dsc" : "Daypart",
"net" : "val2",
"place" : "loc2",
"code" : 23,
"cost" : 101
}
好的,根据 post 中提供的信息和评论,我创建了以下数据集。
注意:我做了一些改动。也都在评论中注明了。
将 _id 更改为在数据库中读取 my_id 因为 _id 字段名称是保留的并且是唯一索引的。
更改"sub_id" 以将值存储为字符串类型。
db.test.insert(
[
{ "my_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : "1"}, {"sub_value": 2}] },
{ "my_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : "2"}, {"sub_value": 5}] },
{ "my_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : "2"}, {"sub_value": 4}] },
{ "my_id": {"Date": "1/1/2018", "Type": "Orange", "client_id": 1},
"Sub_data": [{"sub_id" : "6"}, {"sub_value": 7}] }
])
您需要使用 $group
和 $arrayToObject
来输出预期的格式。
$group
和 $push
推送子数据中的所有值,并将第一个元素映射到键,将第二个元素映射到值,然后 $arrayToObject
格式化为指定的键值.
$mergeObjects
将 _id 与其余值合并。 $replaceRoot
将合并的文档提升到顶级。
db.test.aggregate([
{"$group":{
"_id":"$my_id",
"data":{
"$push":{
"k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$Sub_data",0]}},"in":"$$sub_id_elem.sub_id"}},
"v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$Sub_data",1]}},"in":"$$sub_value_elem.sub_value"}}
}
}
}},
{"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id",{"$arrayToObject":"$data"}]}}}
])
输出:
{Date:"1/2/2018", "Type":"Orange", "client_id": 1", "6":7}
{Date:"1/1/2018", "Type":"Green", "client_id": 1", "2":4}
{Date:"1/2/2018", "Type":"Green", "client_id": 1", "1":2, "2":5}
或者,您可以使用 $mergeObjects
作为累加器在分组时合并对象。
db.test.aggregate([
{"$group":{
"_id":"$my_id","data":{
"$mergeObjects":{
"$arrayToObject":[[
{
"k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$Sub_data",0]}},"in":"$$sub_id_elem.sub_id"}},
"v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$Sub_data",1]}},"in":"$$sub_value_elem.sub_value"}}
}
]]
}
}
}},
{"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id","$data"]}}}
])
我有一个 Mongo 数据库,其中包含导入的平面文件 CSV。在 SQL 中,这个文件无疑应该被规范化:文件包含每个句点一行,句点包含重复的信息。我创建了一个查询,该查询使用 'push' 运算符将(部分)重复信息聚合到行内的单个子对象中。这模仿规范化。我想做的是重组输出对象,以便子对象字典在顶层使用键和值。这在 SQL 中称为数据透视表查询或交叉表查询。在 Excel 中,它被称为转置。不管名称如何,我正在寻找的是获取键值对并将它们用作 Mongo.
中的 'columns' 的能力由于 Mongo 和其他 NoSQL 数据库旨在非规范化实现,我很惊讶这这么难。
我正在尝试将以下 JSON 对象放入 Mongo:
[{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : 1}, {"sub_value": 2}] },
{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : 2}, {"sub_value": 5}] },
{ "_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : 2}, {"sub_value": 4}] },
{ "_id": {"Date": "1/1/2018", "Type": "Orange", "client_id": 1},
"Sub_data": [{"sub_id" : 6}, {"sub_value": 7}] }]
并得到以下结果:
[{ "_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"1" : 2, "2":5},
{ "_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
"2" : 4},
{ "_id": {"Date": "1/2/2018", "Type": "Orange", "client_id": 1},
"6" : 7}]
请注意,我希望此结果有任意数量的列。我已经查看了一些 SEEM 解决问题的解决方案(Array to object, AddFields,
注意:这是试图模仿 SQL 服务器(和 Excel 等)描述的功能 in this Stack Overflow question and this TechNet article。
汇总起来,我使用第一个答案的第二个选项的整个管道如下所示:
db.rate_cards.aggregate(
{
"$group": {
"_id": {
"date": "$date",
"start_date": "$start_date",
"end_date": "$end_date"
},
"code_data": {
"$push": {
"code_str": {"$substr" : ["$code",0,-1]},
"cpm": "$cpm"
}
}
}
},
{
"$group":{
"_id":"$_id",
"data":{
"$mergeObjects":{
"$arrayToObject":[[
{
"k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$code_data",0]}},"in":"$$sub_id_elem.code_str"}},
"v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$code_data",1]}},"in":"$$sub_value_elem.cpm"}}
}
]]
}
}
}
},
{"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id",{"$arrayToObject":"$data"}]}}}
)
请注意,这比我希望的要复杂一些,而且性能要求更高。它似乎声明了一个局部变量,使用了一个 in-clause,等等。在尝试 运行 两个答案的(有效)实施时,NoSQL 助推器扼流圈试图扩展第 600 行。
下面是原始数据集的略微编辑版本。请注意,原始查询中未使用的一些额外字段已被省略:
{
"_id" : ObjectId("5a578d5c57d33b197004beed"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 12m-6a",
"dsc" : "Daypart",
"net" : "val1",
"place" : "loc1",
"code" : 12,
"cost" : 16.8
},
{
"_id" : ObjectId("5a578d5c57d33b197004beee"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 12m-6a",
"dsc" : "Daypart",
"net" : "val1",
"place" : "loc3",
"code" : 24,
"cost" : 55.6
},
{
"_id" : ObjectId("5a578d5c57d33b197004beef"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 12n-6p",
"dsc" : "Daypart",
"net" : "val2",
"place" : "loc2",
"code" : 23,
"cost" : 65.5
},
{
"_id" : ObjectId("5a578d5c57d33b197004bef0"),
"date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"start_date" : ISODate("2017-09-25T03:00:00.000+03:00"),
"end_date" : ISODate("2017-10-01T03:00:00.000+03:00"),
"dp" : "M-Su 6p-12m",
"dsc" : "Daypart",
"net" : "val2",
"place" : "loc2",
"code" : 23,
"cost" : 101
}
好的,根据 post 中提供的信息和评论,我创建了以下数据集。
注意:我做了一些改动。也都在评论中注明了。
将 _id 更改为在数据库中读取 my_id 因为 _id 字段名称是保留的并且是唯一索引的。
更改"sub_id" 以将值存储为字符串类型。
db.test.insert(
[
{ "my_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : "1"}, {"sub_value": 2}] },
{ "my_id": {"Date": "1/1/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : "2"}, {"sub_value": 5}] },
{ "my_id": {"Date": "1/2/2018", "Type": "Green", "client_id": 1},
"Sub_data": [{"sub_id" : "2"}, {"sub_value": 4}] },
{ "my_id": {"Date": "1/1/2018", "Type": "Orange", "client_id": 1},
"Sub_data": [{"sub_id" : "6"}, {"sub_value": 7}] }
])
您需要使用 $group
和 $arrayToObject
来输出预期的格式。
$group
和 $push
推送子数据中的所有值,并将第一个元素映射到键,将第二个元素映射到值,然后 $arrayToObject
格式化为指定的键值.
$mergeObjects
将 _id 与其余值合并。 $replaceRoot
将合并的文档提升到顶级。
db.test.aggregate([
{"$group":{
"_id":"$my_id",
"data":{
"$push":{
"k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$Sub_data",0]}},"in":"$$sub_id_elem.sub_id"}},
"v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$Sub_data",1]}},"in":"$$sub_value_elem.sub_value"}}
}
}
}},
{"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id",{"$arrayToObject":"$data"}]}}}
])
输出:
{Date:"1/2/2018", "Type":"Orange", "client_id": 1", "6":7}
{Date:"1/1/2018", "Type":"Green", "client_id": 1", "2":4}
{Date:"1/2/2018", "Type":"Green", "client_id": 1", "1":2, "2":5}
或者,您可以使用 $mergeObjects
作为累加器在分组时合并对象。
db.test.aggregate([
{"$group":{
"_id":"$my_id","data":{
"$mergeObjects":{
"$arrayToObject":[[
{
"k":{"$let":{"vars":{"sub_id_elem":{"$arrayElemAt":["$Sub_data",0]}},"in":"$$sub_id_elem.sub_id"}},
"v":{"$let":{"vars":{"sub_value_elem":{"$arrayElemAt":["$Sub_data",1]}},"in":"$$sub_value_elem.sub_value"}}
}
]]
}
}
}},
{"$replaceRoot":{"newRoot":{"$mergeObjects":["$_id","$data"]}}}
])