使用聚合将字段与外部数组值匹配
Matching a field to external array values with Aggregation
到目前为止,我有一个看起来像这样的查询:
db.variants.aggregate({'$unwind':'$samples'},{$project:{"_id":0,"samples":1,"chr":1,"pos":1,"ref":1,"alt":1}});
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}
但在查询之外,我定义了这样的样本组:
SID1=['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9']
SID2=['559de1b2aa43f']
但是当涉及到如何进行实际分组时我被卡住了,因为我的 sample_id 数组实际上并不在文档中,所以正常的 $group 运算符将不起作用。我想创建名为 SID1 和 SID2 的组,如果 samples.study_id 在 SID1 数组中(SID2 也是如此),则对 'samples.GTC' 求和。我的预期输出是:
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", SID1:1, SID2:2}
{ "chr" : "22", "pos" : 14371, "ref" : "A", "alt" : "G", SID1:1, SID2:0}
我猜它应该有点接近这个,但显然不完全是:
db.variants.aggregate(
{'$unwind':'$samples'},
{$project:
{"_id":0,"chr":1,"pos":1,"ref":1,"alt":1,"samples":1}
},
{ "$group":{
"_id":{"chr":"$chr","pos":"$pos","ref":"$ref","alt":"$alt"},
"SID1" : {$cond:{if:{"$sample_id":{$in:SID1}},then:{$sum:"$GTC"},else:0}},
"SID2" : {$cond:{if:{"$sample_id":{$in:SID2}},then:{$sum:"$GTC"},else:0}},
}
});
您为此寻找 $cond
是正确的,但语法有点错误,您还需要其他一些帮助:
var SID1 = ['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9'],
SID2 = ['559de1b2aa43f'];
db.variants.aggregate([
{ "$unwind": "$samples" },
{ "$group": {
"_id": {
"chr": "$chr",
"pos": "$pos",
"ref": "$ref",
"alt": "$alt"
},
"SID1": {
"$sum": {
"$cond": [
{ "$setIsSubset": [
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id"
}},
SID1
]},
"$samples.GTC",
0
]
}
},
"SID2": {
"$sum": {
"$cond": [
{ "$setIsSubset": [
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id"
}},
SID2
]},
"$samples.GTC",
0
]
}
}
}}
])
这给出了结果:
{
"_id" : {
"chr" : "20",
"pos" : 14371,
"ref" : "A",
"alt" : "G"
},
"SID1" : 1,
"SID2" : 0
}
{
"_id" : {
"chr" : "22",
"pos" : 14373,
"ref" : "C",
"alt" : "T"
},
"SID1" : 1,
"SID2" : 2
}
因此 $cond
变为 "inside" $sum
since that is an "accumulator" and therefore how you you structure under $group
。
在定义管道时直接使用变量名没有错,因为该值将只是 "interpolate" 并被视为文字。但是,当然,因为这些 "arrays" 需要这样比较它们。更重要的是,它们实际上是 "sets".
$setIsSubset
运算符可以 "logically" 比较两个 "sets" 以查看一个是否包含另一个的元素。这为 $cond
提供了合乎逻辑的 true/false
。
但是 "samples.sample_id" 字段不是数组。但是我们可以简单地 "make it into one" 通过使用声明为单个元素的 $map
operator feeding it a $literal
数组并转置值。
$map
运算符与许多编程语言中的 function of the same name 做同样的事情,它作用于数组,就像它是 "input" 一样。它通过处理来自 "in" 的函数表达式,将每个数组元素作为来自 "as" 的声明变量处理。它 returns 一个与输入长度相同的数组,但结果由函数表达式应用。再举个例子:
{ "$map": {
"input": { "$literal": [1,2,3,4] }, // input array
"as": "el", // variable represents element
"in": {
"$multiply": [ "$$el", "$$el" ] // square of element
}
}
Returns:
[1,4,9,16] // All array elements "squared"
$literal
运算符实际上自 MongoDB 2.2 引入聚合框架以来就已经存在,但它是未记录的运算符 $const
。虽然前面提到 "injecting" 外部变量进入聚合管道没有任何问题,但您不能做的一件事是 "return" 该值作为 属性 的 属性文档。作为表达式参数,这在大多数情况下都很好,但例如你不能这样做:
{ "$project": {
"myfield": ["bill","ted","fred"]
}}
这会导致错误,所以您改为:
{ "$project": {
"myfield": { "$literal": ["bill","ted","fred"] }
}}
允许将字段设置为您想要的值数组。
因此,结合列表中的 $map
,它只是一种表示管道中不存在的单个元素的数组的方式,以便 "tranpose" 它的值与当前字段。
变成这样:
"559de1b2aa43f47656b2a3fa"
通过代码:
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id" // into this ["559de1b2aa43f47656b2a3fa"]
}}
这使得 $setIsSubset
操作在内部看起来像这样:
{ "$setIsSubset": [
["559de1b2aa43f47656b2a3fa"],
["559de1b2aa43f47656b2a3fa","559de1b2aa43f47656b2a3f9"]
}} // true
最终结果是比较每个变量以查看包含的值是否与其元素之一匹配,并将适当的 "field value" 发送到 $sum
进行累加。
此外,删除 $project
,因为这通常由 $group
阶段处理,将其留在那里会导致处理开销,因为需要首先循环遍历管道中的每个文档。所以它并没有真正优化任何东西,而是让你付出代价。
顺便说一句。到目前为止,管道输出中的示例数据缺少结尾 "brace"。我在下面使用了这些数据(当然管道中没有 $unwind)
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"} },
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}},
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}},
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}},
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}}
到目前为止,我有一个看起来像这样的查询:
db.variants.aggregate({'$unwind':'$samples'},{$project:{"_id":0,"samples":1,"chr":1,"pos":1,"ref":1,"alt":1}});
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}
但在查询之外,我定义了这样的样本组:
SID1=['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9']
SID2=['559de1b2aa43f']
但是当涉及到如何进行实际分组时我被卡住了,因为我的 sample_id 数组实际上并不在文档中,所以正常的 $group 运算符将不起作用。我想创建名为 SID1 和 SID2 的组,如果 samples.study_id 在 SID1 数组中(SID2 也是如此),则对 'samples.GTC' 求和。我的预期输出是:
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", SID1:1, SID2:2}
{ "chr" : "22", "pos" : 14371, "ref" : "A", "alt" : "G", SID1:1, SID2:0}
我猜它应该有点接近这个,但显然不完全是:
db.variants.aggregate(
{'$unwind':'$samples'},
{$project:
{"_id":0,"chr":1,"pos":1,"ref":1,"alt":1,"samples":1}
},
{ "$group":{
"_id":{"chr":"$chr","pos":"$pos","ref":"$ref","alt":"$alt"},
"SID1" : {$cond:{if:{"$sample_id":{$in:SID1}},then:{$sum:"$GTC"},else:0}},
"SID2" : {$cond:{if:{"$sample_id":{$in:SID2}},then:{$sum:"$GTC"},else:0}},
}
});
您为此寻找 $cond
是正确的,但语法有点错误,您还需要其他一些帮助:
var SID1 = ['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9'],
SID2 = ['559de1b2aa43f'];
db.variants.aggregate([
{ "$unwind": "$samples" },
{ "$group": {
"_id": {
"chr": "$chr",
"pos": "$pos",
"ref": "$ref",
"alt": "$alt"
},
"SID1": {
"$sum": {
"$cond": [
{ "$setIsSubset": [
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id"
}},
SID1
]},
"$samples.GTC",
0
]
}
},
"SID2": {
"$sum": {
"$cond": [
{ "$setIsSubset": [
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id"
}},
SID2
]},
"$samples.GTC",
0
]
}
}
}}
])
这给出了结果:
{
"_id" : {
"chr" : "20",
"pos" : 14371,
"ref" : "A",
"alt" : "G"
},
"SID1" : 1,
"SID2" : 0
}
{
"_id" : {
"chr" : "22",
"pos" : 14373,
"ref" : "C",
"alt" : "T"
},
"SID1" : 1,
"SID2" : 2
}
因此 $cond
变为 "inside" $sum
since that is an "accumulator" and therefore how you you structure under $group
。
在定义管道时直接使用变量名没有错,因为该值将只是 "interpolate" 并被视为文字。但是,当然,因为这些 "arrays" 需要这样比较它们。更重要的是,它们实际上是 "sets".
$setIsSubset
运算符可以 "logically" 比较两个 "sets" 以查看一个是否包含另一个的元素。这为 $cond
提供了合乎逻辑的 true/false
。
但是 "samples.sample_id" 字段不是数组。但是我们可以简单地 "make it into one" 通过使用声明为单个元素的 $map
operator feeding it a $literal
数组并转置值。
$map
运算符与许多编程语言中的 function of the same name 做同样的事情,它作用于数组,就像它是 "input" 一样。它通过处理来自 "in" 的函数表达式,将每个数组元素作为来自 "as" 的声明变量处理。它 returns 一个与输入长度相同的数组,但结果由函数表达式应用。再举个例子:
{ "$map": {
"input": { "$literal": [1,2,3,4] }, // input array
"as": "el", // variable represents element
"in": {
"$multiply": [ "$$el", "$$el" ] // square of element
}
}
Returns:
[1,4,9,16] // All array elements "squared"
$literal
运算符实际上自 MongoDB 2.2 引入聚合框架以来就已经存在,但它是未记录的运算符 $const
。虽然前面提到 "injecting" 外部变量进入聚合管道没有任何问题,但您不能做的一件事是 "return" 该值作为 属性 的 属性文档。作为表达式参数,这在大多数情况下都很好,但例如你不能这样做:
{ "$project": {
"myfield": ["bill","ted","fred"]
}}
这会导致错误,所以您改为:
{ "$project": {
"myfield": { "$literal": ["bill","ted","fred"] }
}}
允许将字段设置为您想要的值数组。
因此,结合列表中的 $map
,它只是一种表示管道中不存在的单个元素的数组的方式,以便 "tranpose" 它的值与当前字段。
变成这样:
"559de1b2aa43f47656b2a3fa"
通过代码:
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id" // into this ["559de1b2aa43f47656b2a3fa"]
}}
这使得 $setIsSubset
操作在内部看起来像这样:
{ "$setIsSubset": [
["559de1b2aa43f47656b2a3fa"],
["559de1b2aa43f47656b2a3fa","559de1b2aa43f47656b2a3f9"]
}} // true
最终结果是比较每个变量以查看包含的值是否与其元素之一匹配,并将适当的 "field value" 发送到 $sum
进行累加。
此外,删除 $project
,因为这通常由 $group
阶段处理,将其留在那里会导致处理开销,因为需要首先循环遍历管道中的每个文档。所以它并没有真正优化任何东西,而是让你付出代价。
顺便说一句。到目前为止,管道输出中的示例数据缺少结尾 "brace"。我在下面使用了这些数据(当然管道中没有 $unwind)
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"} },
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}},
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}},
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}},
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}}