bson.json_util 转储功能提高性能的技巧
Tips for increasing performance for bson.json_util dump function
我有一个从 mongo 读取的服务,需要将具有相同 metadata_id 的所有记录转储到本地临时文件中。有没有办法 optimize/speed 提高 bson.json_util 转储部分?
查询部分,其中所有内容都加载到游标中,对于数百 Mbs 总是需要不到 30 秒,但是转储部分需要大约 1 小时。
归档 ~0.2TB 数据需要 3 天。
def dump_snapshot_to_local_file(mongo, database, collection, metadata_id, file_path, dry_run=False):
"""
Creates a gz archive for all documents with the same metadata_id
"""
cursor = mongo_find(mongo.client[database][collection], match={"metadata_id": ObjectId(metadata_id)})
path = file_path + '/' + database + '/' + collection + '/'
create_directory(path)
path = path + metadata_id + '.json.gz'
ok = False
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving to temp location %s", path)
file.write(b'{"documents":[')
for document in cursor:
if ok:
file.write(b',')
ok = True
file.write(dumps(document).encode())
file.write(b']}')
except IOError as e:
logging.error("Failed exporting data with metadata_id %s to gz. Error: %s", metadata_id, e.strerror)
return False
finally:
file.close()
if not is_gz_file(path):
logging.error("Failed to create gzip file for data with metadata_id %", metadata_id)
return False
logging.info("Data with metadata_id %s was successfully saved at temp location", metadata_id)
return True
有更好的方法吗?
如有任何提示,我们将不胜感激。
因为我没有使用任何 JSONOptions 功能,而且该服务大部分时间都在做 json_util 转储,离开它并直接转储到 bson,没有 json 转换,证明从最初的 40 分钟加载中节省了 35 分钟(在 180 万个文档上,~3.5GB)
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving snapshot to temp location %s", path)
for document in cursor:
file.write(bson.BSON.encode(document))
我有一个从 mongo 读取的服务,需要将具有相同 metadata_id 的所有记录转储到本地临时文件中。有没有办法 optimize/speed 提高 bson.json_util 转储部分? 查询部分,其中所有内容都加载到游标中,对于数百 Mbs 总是需要不到 30 秒,但是转储部分需要大约 1 小时。
归档 ~0.2TB 数据需要 3 天。
def dump_snapshot_to_local_file(mongo, database, collection, metadata_id, file_path, dry_run=False):
"""
Creates a gz archive for all documents with the same metadata_id
"""
cursor = mongo_find(mongo.client[database][collection], match={"metadata_id": ObjectId(metadata_id)})
path = file_path + '/' + database + '/' + collection + '/'
create_directory(path)
path = path + metadata_id + '.json.gz'
ok = False
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving to temp location %s", path)
file.write(b'{"documents":[')
for document in cursor:
if ok:
file.write(b',')
ok = True
file.write(dumps(document).encode())
file.write(b']}')
except IOError as e:
logging.error("Failed exporting data with metadata_id %s to gz. Error: %s", metadata_id, e.strerror)
return False
finally:
file.close()
if not is_gz_file(path):
logging.error("Failed to create gzip file for data with metadata_id %", metadata_id)
return False
logging.info("Data with metadata_id %s was successfully saved at temp location", metadata_id)
return True
有更好的方法吗?
如有任何提示,我们将不胜感激。
因为我没有使用任何 JSONOptions 功能,而且该服务大部分时间都在做 json_util 转储,离开它并直接转储到 bson,没有 json 转换,证明从最初的 40 分钟加载中节省了 35 分钟(在 180 万个文档上,~3.5GB)
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving snapshot to temp location %s", path)
for document in cursor:
file.write(bson.BSON.encode(document))