bson.json_util 转储功能提高性能的技巧

Tips for increasing performance for bson.json_util dump function

我有一个从 mongo 读取的服务,需要将具有相同 metadata_id 的所有记录转储到本地临时文件中。有没有办法 optimize/speed 提高 bson.json_util 转储部分? 查询部分,其中所有内容都加载到游标中,对于数百 Mbs 总是需要不到 30 秒,但是转储部分需要大约 1 小时。

归档 ~0.2TB 数据需要 3 天。

def dump_snapshot_to_local_file(mongo, database, collection, metadata_id, file_path, dry_run=False):
    """
       Creates a gz archive for all documents with the same metadata_id
    """

    cursor = mongo_find(mongo.client[database][collection], match={"metadata_id": ObjectId(metadata_id)})
    path = file_path + '/' + database + '/' + collection + '/'
    create_directory(path)
    path = path + metadata_id + '.json.gz'

    ok = False
    try:
        with gzip.open(path, 'wb') as file:
            logging.info("Saving to temp location %s", path)
            file.write(b'{"documents":[')
            for document in cursor:
                if ok:
                    file.write(b',')
                ok = True
                file.write(dumps(document).encode())
            file.write(b']}')
    except IOError as e:
        logging.error("Failed exporting data with metadata_id %s to gz. Error: %s", metadata_id, e.strerror)
        return False
    finally:
        file.close()

    if not is_gz_file(path):
        logging.error("Failed to create gzip file for data with metadata_id %", metadata_id)
        return False

    logging.info("Data with metadata_id %s was successfully saved at temp location", metadata_id)
    return True

有更好的方法吗?

如有任何提示,我们将不胜感激。

因为我没有使用任何 JSONOptions 功能,而且该服务大部分时间都在做 json_util 转储,离开它并直接转储到 bson,没有 json 转换,证明从最初的 40 分钟加载中节省了 35 分钟(在 180 万个文档上,~3.5GB)

    try:
        with gzip.open(path, 'wb') as file:
            logging.info("Saving snapshot to temp location %s", path)
            for document in cursor:
                file.write(bson.BSON.encode(document))