在 Mongodb 中同时出现
Cooccurrence in Mongodb
我熟悉Mysql等关系型数据库。最近我正在做一个项目,使用 MongoDb 来储存 100000 份文件。这些文档的结构是这样的:
{
"_id" : "1",
"abstract" : "In this book, we present ....",
"doi" : "xxxx/xxxxxx",
"authors" : [
"Davis, a",
"louis, X",
"CUI, Li",
"FANG, Y"
]
}
我想提取所有可能的作者组合(仅成对)的共现矩阵或共现值。预期的输出是:
{ [auth1, auth2: 5] [auth1, auth3: 1] [auth2, auth8: 9]...}
这意味着 auth1 和 auth2 coollaborate 5 次(在 5 本书中)auth2 和 auth8 合作 9 次....
在关系数据库中,可能的解决方案是:
例如在 table auth-book 中:
INSERT INTO auth-book (id_book, id_auth) VALUES
(1, 'auth1'),
(1, 'auth2'),
(1, 'auth3'),
(2, 'auth1'),
(2, 'auth5'),
(2, 'auth87'),
(2, 'auth2')...
计算作者的共现或合作的查询是:
SELECT a.id_auth a, b.id_auth b, COUNT(*) cnt
FROM auth-book a JOIN auth-booke b ON b.id_book= a.id_book AND .id_auth > a.id_auth
GROUP BY a.id_auth, b.id_auth
输出:[auth1 auth2 => 2][auth1 auth3 => 1][auth2 auth3 => 1]......等等
我不知道如何在 mongodb
中实现这样的查询
可能不是您要找的东西,但这 "works." 挑战在于创建配对列表,这是 SQL 中的自我加入所做的。将配对从查询中移出也意味着以后可以更容易地从成对更改为三元组或任何内容,因为 find()
使用 $all
函数来确保匹配任意数量的项目。
c=db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
]);
doc = c.next(); // Get single doc output from above
ulist = doc['uu'];
c.close();
// Create unique pairs:
pairs = [];
for(var index = 0; index < ulist.length; index++) {
var first = ulist[index];
for(var p2 = index+1; p2 < ulist.length; p2++) {
pairs.push([first, ulist[p2]]);
}
}
// And now go get 'em:
pairs.forEach(function(qq) {
n=db.foo.find({"authors": {$all: qq }}).count();
print(qq[0] + "-" + qq[1] + ": " + n);
});
这有点可怕,但似乎有效。基本上,我们 $lookup
反对我们自己,并利用这样一个事实,即在字符串数组中查找字符串是一个隐式的 $in
类操作。
db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
// Now unwind and go lookup into the authors array in the same collection. If $uu
// appears anywhere in the authors array, it is a match.
,{$unwind: "$uu"}
,{$lookup: { from: "foo", localField: "uu", foreignField: "authors", as: "X"}}
// OK, lots of material in the pipe here. Brace for a double unwind -- but
// in this data domain it is probably OK because not every book has the same
// set of authors, which will attenuate the result $lookup above.
,{$unwind: "$X"}
,{$unwind: "$X.authors"}
// The data is starting to be clear here. But we see two things:
// self-matching (author1->author1) and pairs of inverses
// (author1->author2 / author2->author1). We don't need self matches and we
// only need "one side" of the inverses. So... let's do a strcasecmp
// on them. 0 means the same so that's out. So pick 1 or -1 to get one side
// of the pair; doesn't really matter! This is why the OP has b.id_auth > a.id_auth
// in the SQL equivalent: to eliminate self match and one side of inverses.
,{$addFields: {q: {$strcasecmp: ["$uu", "$X.authors"] }} }
,{$match: {q: 1}}
// And now group!
,{$group: {_id: {a: "$uu", b: "$X.authors"}, n: {$sum:1}}}
]);
我熟悉Mysql等关系型数据库。最近我正在做一个项目,使用 MongoDb 来储存 100000 份文件。这些文档的结构是这样的:
{
"_id" : "1",
"abstract" : "In this book, we present ....",
"doi" : "xxxx/xxxxxx",
"authors" : [
"Davis, a",
"louis, X",
"CUI, Li",
"FANG, Y"
]
}
我想提取所有可能的作者组合(仅成对)的共现矩阵或共现值。预期的输出是: { [auth1, auth2: 5] [auth1, auth3: 1] [auth2, auth8: 9]...} 这意味着 auth1 和 auth2 coollaborate 5 次(在 5 本书中)auth2 和 auth8 合作 9 次.... 在关系数据库中,可能的解决方案是: 例如在 table auth-book 中:
INSERT INTO auth-book (id_book, id_auth) VALUES
(1, 'auth1'),
(1, 'auth2'),
(1, 'auth3'),
(2, 'auth1'),
(2, 'auth5'),
(2, 'auth87'),
(2, 'auth2')...
计算作者的共现或合作的查询是:
SELECT a.id_auth a, b.id_auth b, COUNT(*) cnt
FROM auth-book a JOIN auth-booke b ON b.id_book= a.id_book AND .id_auth > a.id_auth
GROUP BY a.id_auth, b.id_auth
输出:[auth1 auth2 => 2][auth1 auth3 => 1][auth2 auth3 => 1]......等等
我不知道如何在 mongodb
中实现这样的查询可能不是您要找的东西,但这 "works." 挑战在于创建配对列表,这是 SQL 中的自我加入所做的。将配对从查询中移出也意味着以后可以更容易地从成对更改为三元组或任何内容,因为 find()
使用 $all
函数来确保匹配任意数量的项目。
c=db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
]);
doc = c.next(); // Get single doc output from above
ulist = doc['uu'];
c.close();
// Create unique pairs:
pairs = [];
for(var index = 0; index < ulist.length; index++) {
var first = ulist[index];
for(var p2 = index+1; p2 < ulist.length; p2++) {
pairs.push([first, ulist[p2]]);
}
}
// And now go get 'em:
pairs.forEach(function(qq) {
n=db.foo.find({"authors": {$all: qq }}).count();
print(qq[0] + "-" + qq[1] + ": " + n);
});
这有点可怕,但似乎有效。基本上,我们 $lookup
反对我们自己,并利用这样一个事实,即在字符串数组中查找字符串是一个隐式的 $in
类操作。
db.foo.aggregate([
// Create single deduped list of authors:
{$unwind: "$authors"}
,{$group: {_id:null, uu: {$addToSet: "$authors"} }}
// Now unwind and go lookup into the authors array in the same collection. If $uu
// appears anywhere in the authors array, it is a match.
,{$unwind: "$uu"}
,{$lookup: { from: "foo", localField: "uu", foreignField: "authors", as: "X"}}
// OK, lots of material in the pipe here. Brace for a double unwind -- but
// in this data domain it is probably OK because not every book has the same
// set of authors, which will attenuate the result $lookup above.
,{$unwind: "$X"}
,{$unwind: "$X.authors"}
// The data is starting to be clear here. But we see two things:
// self-matching (author1->author1) and pairs of inverses
// (author1->author2 / author2->author1). We don't need self matches and we
// only need "one side" of the inverses. So... let's do a strcasecmp
// on them. 0 means the same so that's out. So pick 1 or -1 to get one side
// of the pair; doesn't really matter! This is why the OP has b.id_auth > a.id_auth
// in the SQL equivalent: to eliminate self match and one side of inverses.
,{$addFields: {q: {$strcasecmp: ["$uu", "$X.authors"] }} }
,{$match: {q: 1}}
// And now group!
,{$group: {_id: {a: "$uu", b: "$X.authors"}, n: {$sum:1}}}
]);