pymongo 游标 'touch' 以避免超时

Question

我需要从 mongo (v3.2.10) 集合中获取大量（例如 1 亿）文档（使用 Pymongo 3.3.0）并迭代它们。迭代需要几天时间，而且我经常运行由于游标超时导致异常。

在我的例子中，我需要在迭代时睡眠不可预测的时间。因此，例如我可能需要： - 获取 10 个文档 - 睡眠 1 秒 - 获取 1000 个文档 - 睡 4 小时 - 获取 1 个文档等等

我知道我可以：

完全禁用超时，但我想尽可能避免这种情况，因为如果我的代码完全停止运行，可以为我清理游标
减少我的光标 batch_size 但这不会有帮助，例如如上例所示，我需要睡 4 个小时

似乎一个不错的解决方案是 'touch' 使光标保持活动状态的方法。因此，例如，我会将长时间的睡眠分成较短的间隔，并在每个间隔之间触摸光标。

我没有看到通过 pymongo 执行此操作的方法，但我想知道是否有人确切知道这是否可行。

Answer 1

当然不可能，你要的是功能SERVER-6036，没有实现。

对于如此长的运行任务，我建议在索引字段上进行查询。例如。如果您的文档都有时间戳 "ts":

documents = list(collection.find().sort('ts').limit(1000))
for doc in documents:
    # ... process doc ...

while True:
    ids = set(doc['_id'] for doc in documents)
    cursor = collection.find({'ts': {'$gte': documents[-1]['ts']}})
    documents = list(cursor.limit(1000).sort('ts'))
    if not documents:
        break  # All done.
    for doc in documents:
        # Avoid overlaps
        if doc['_id'] not in ids:
            # ... process doc ...

此代码完全迭代游标，因此不会超时，然后处理 1000 个文档，然后重复下一个 1000 个文档。

第二个想法：用 a very long cursor timeout:

配置你的服务器

mongod --setParameter cursorTimeoutMillis=21600000  # 6 hrs

第三个想法：虽然不是完全确定，但您可以更加确定，您将通过在 with 语句中使用它来关闭非超时游标:

cursor = collection.find(..., no_cursor_timeout=True)
with cursor:
    # PyMongo will try to kill cursor on server
    # if you leave this block.
    for doc in cursor:
        # do stuff....

Answer 2

对我来说甚至 no_cursor_timeout=True 都没用，所以我创建了一个函数，将光标中的数据保存在一个临时文件中，然后将文档从文件返回给调用者。

from tempfile import NamedTemporaryFile
import pickle
import os

def safely_read_from_cursor(cursor):
    # save data in a local file
    with NamedTemporaryFile(suffix='.pickle', prefix='data_', delete=False) as data_file, cursor:
        for count, doc in enumerate(cursor, 1):
            pickle.dump(doc, data_file)

    # open file again and iterate over data
    with open(data_file.name, mode="rb") as data_file:
        for _ in range(count):
            yield pickle.load(data_file)

    # remove temporary file
    os.remove(data_file.name)

pymongo 游标 'touch' 以避免超时

pymongo cursor 'touch' to avoid timeout

mongodb

pymongo