在单个查询中匹配来自两个查询的键

Question

我在mongodb中有时间序列数据如下：

{ 
    "_id" : ObjectId("558912b845cea070a982d894"),
    "code" : "ZL0KOP",
    "time" : NumberLong("1420128024000"),
    "direction" : "10", 
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d895"), 
    "code" : "AQ0ZSQ", 
    "time" : NumberLong("1420128025000"), 
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d896"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420128003000"),
    "direction" : "10", 
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d897"), 
    "code" : "ZL0KOP",
    "time" : NumberLong("1420041724000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d89e"),
    "code" : "YBUHCW",
    "time" : NumberLong("1420041732000"),
    "direction" : "10",
    "siteId" : "0002"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d8a1"),
    "code" : "U48AIW",
    "time" : NumberLong("1420041729000"),
    "direction" : "10",
    "siteId" : "0002"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d8a0"), 
    "code" : "OJ3A06",
    "time" : NumberLong("1420300927000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d89d"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420300885000"),
    "direction" : "10",
    "siteId" : "0003"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d8a2"),
    "code" : "ZLV05H",
    "time" : NumberLong("1420300922000"),
    "direction" : "10",
    "siteId" : "0001"
}
{
    "_id" : ObjectId("558912b845cea070a982d8a3"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420300928000"),
    "direction" : "10", 
    "siteId" : "0000"
}

需要过滤掉符合两个或两个以上条件的代码。例如：

condition1: 1420128000000 < time < 1420128030000,siteId == 0000
condition2: 1420300880000 < time < 1420300890000,siteId == 0003

第一个条件的结果：

{ 
    "_id" : ObjectId("558912b845cea070a982d894"),
    "code" : "ZL0KOP",
    "time" : NumberLong("1420128024000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d895"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420128025000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d896"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420128003000"),
    "direction" : "10",
    "siteId" : "0000"
}

第二个条件的结果：

{ 
    "_id" : ObjectId("558912b845cea070a982d89d"),
    "code" : "AQ0ZSQ", "time" : NumberLong("1420300885000"),
    "direction" : "10",
    "siteId" : "0003"
}

唯一符合以上所有条件的代码应该是：

{"code" : "AQ0ZSQ", "count":2}

"count"表示，代码"AQ0ZSQ"出现在两种情况下

我能想到的唯一解决方案是使用两个查询。例如，使用 python

result1 = list(db.codes.objects({'time': {'$gt': 1420128000000,'$lt': 1420128030000}, 'siteId': "0000"}).only("code"))
result2 = list(db.codes.objects({'time': {'$gt': 1420300880000,'$lt': 1420300890000}},{'siteId':'0003'}).only("code"))

然后在两个结果中都找到了共享代码。

问题是集合中有数百万个文档，两个查询很容易超过 16mb 的限制。

那么是否可以在一次查询中做到这一点？还是我应该更改文档结构？

Answer 1

您在此处要求的内容需要使用 aggregation framework 才能计算出服务器上的结果之间存在交集。

逻辑的第一部分是您需要 $or 查询这两个条件，然后会对这些结果进行一些额外的投影和过滤：

db.collection.aggregate([
    // Fetch all possible documents for consideration
    { "$match": {
        "$or": [
            { 
                "time": { "$gt": 1420128000000, "$lt": 1420128030000 },
                "siteId": "0000"
            },
            {
                "time": { "$gt": 1420300880000, "$lt": 1420300890000 },
                "siteId": "0003"
            }
        ]
    }},

    // Locigically compare the conditions agaist results and add a score
    { "$project": {
        "code": "$code",
        "score": { "$add": [
            { "$cond": [ 
                { "$and":[
                    { "$gt": [ "$time", 1420128000000 ] },
                    { "$lt": [ "$time", 1420128030000 ] },
                    { "$eq": [ "$siteId", "0000" ] }
                ]},
                1,
                0
            ]},
            { "$cond": [ 
                { "$and":[
                    { "$gt": [ "$time", 1420300880000 ] },
                    { "$lt": [ "$time", 1420300890000 ] },
                    { "$eq": [ "$siteId", "0003" ] }
                ]},
                1,
                0
            ]}
        ]}
    }},

    // Now Group the results by "code"
    { "$group": {
        "_id": "$code",
        "score": { "$sum": "$score" }
    }},

    // Now filter to keep only results with score 2
    { "$match": { "score": 2 } }
])

所以分解一下，看看它是如何工作的。

首先，您需要使用 $match 进行查询，以获取符合 "intersection" 条件的 "all" 的所有可能文档。考虑到匹配的文档必须满足任一集合，这就是 $or 表达式在这里允许的。您需要所有这些才能在这里计算出 "intersection"。

聚合框架的第二个$project pipeline stage a boolean test of your conditions is performed with each set. Notice the usage of $and here as well as other boolean operators与查询使用形式略有不同。

在聚合框架形式中（在使用普通查询运算符的 $match 之外），这些运算符采用参数数组，通常表示用于比较的 "two" 值，而不是分配给的操作字段名的"right"。

由于这些条件是合乎逻辑的或 "boolean" 我们希望 return 结果为 "numeric" 而不是 true/false 值。这就是 $cond 在这里所做的。因此，如果检查的文档的条件为真，则发出 1 的分数，否则当为假时，它为 0。

最后，在这个 $project 表达式中，您的两个条件都用 $add 包裹起来，形成 "score" 结果。因此，如果 none 的条件（在 $match 之后不可能）不为真，则分数将为 0，如果 "one" 为真则为 1，或者 "both" 为真则为 2。

请注意，对于单个文档，此处要求的特定条件永远不会超过 1，因为没有文档可以具有重叠范围或存在的 "two" "siteId" 值这里。

现在重要的部分是 $group by the "code" value and $sum 分值，得到总分 "code"。

这使管道的最后 $match 过滤阶段只保留那些 "score" 值等于您要求的条件数的文档。在这种情况下 2.

然而，这里有一个失败之处在于，如果在任一条件的匹配中有多个 "code" 的值（确实如此），那么这里的 "score" 将是不正确的。

所以在介绍了在聚合中使用逻辑运算符的原则之后，您可以通过本质上 "tagging" 每个结果逻辑上 "set" 它适用于哪个条件来修复该故障。那么你基本上可以考虑在这种情况下"both"集中出现了哪些"code"：

db.collection.aggregate([
    { "$match": {
        "$or": [
            { 
                "time": { "$gt": 1420128000000, "$lt": 1420128030000 },
                "siteId": "0000"
            },
            {
                "time": { "$gt": 1420300880000, "$lt": 1420300890000 },
                "siteId": "0003"
            }
        ]
    }},

    // If it's the first logical condition it's "A" otherwise it can
    // only be the other, therefore "B". Extend for more sets as needed.
    { "$group": {
        "_id": {
            "code": "$code",
            "type": { "$cond": [ 
                { "$and":[
                    { "$gt": [ "$time", 1420128000000 ] },
                    { "$lt": [ "$time", 1420128030000 ] },
                    { "$eq": [ "$siteId", "0000" ] }
                ]},
                "A",
                "B"
            ]}
        }
    }},

    // Simply add up the results for each "type"
    { "$group": {
        "_id": "$_id.code",
        "score": { "$sum": 1 }
    }}

    // Now filter to keep only results with score 2
    { "$match": { "score": 2 } }
])

如果这是您第一次使用聚合框架，可能有点难以理解。请花时间查看此处链接定义的运算符，并查看一般情况下的 Aggregation Pipeline Operators。

除了简单的数据选择，这是您在使用 MongoDB 时应该最常接触到的工具，因此您最好了解所有可能的操作。

在单个查询中匹配来自两个查询的键

Match on key from two queries in a single query

mongodb

mongodb-query

aggregation-framework