Mongodb：在 collection.aggregate() 期间忽略大型文档（BSON > 16 MB）

Question

我正在扫描一个 mongodb 集合，其中包含包含 bson 大小超过 16 MB 的大型文档。本质上，我根据随机抽样的标志调用 2 中的任何一个：

documents = collection.aggregate(
                [{"$sample": {"size": sample_size}}], allowDiskUse=True)

或

documents = collection.aggregate(
                [{"$limit": sample_size}], allowDiskUse=True)

sample_size这里是一个参数。

问题是这个命令在大 bson 上卡住了几分钟，然后最终 mongodb 中止执行，我对整个集合的扫描没有完成。

有没有办法告诉 mongodb 到 skip/ignore 大小大于阈值的文档？

For those who think that MongoDB cannot store values larger than 16 MB, here is the error message by a metadata collector (LinkedIn DataHub):

OperationFailure: BSONObj size: 17375986 (0x10922F2) is invalid. 
Size must be between 0 and 16793600(16MB) First element: _id: "Topic XYZ",
full error: {'operationTime': Timestamp(1634531126, 2), 'ok': 0.0, 'errmsg': 'BSONObj size: 17375986 (0x10922F2) is invalid. Size must be between 0 and 16793600(16MB)

Answer 1

文档最大大小为 16 MB see
（例外是 GridFS specification）

在您的 collection 中，每个文档已经小于 16MB，MongoDB 不允许我们存储更大的文档。

如果要过滤，假设 <10 MB
您可以使用 "$bsonSize" 运算符来获取文档的大小并过滤掉大的。

Mongodb：在 collection.aggregate() 期间忽略大型文档（BSON > 16 MB）

Mongodb: ignore large documents ( BSON > 16 MB) during collection.aggregate()

mongodb

pymongo