如何从 aws s3 存储桶中读取镶木地板文件并将它们保存为 jsons 在 jupyter 中

Question

我正在使用 python 在 jupyter notebook 中工作。我正在尝试读取 aws s3 存储桶中一个文件夹中的所有镶木地板文件，并将它们作为 jsons 保存在我的 jupyter 目录中的一个文件夹中。我有以下代码，但我相信它只是在阅读它们，我想将它们保存为 jsons。谢谢！

bucketname = 'my-bucket'
bucket = response.Bucket(bucketname)
for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()

Answer 1

如果我对你的问题的理解正确，你想将文件下载到你的文件系统而不是加载到内存中。这是完成这项工作的示例代码片段。

bucketname = 'my-bucket'
bucket = response.Bucket(bucketname)
for obj in bucket.objects.all():
    obj.Object().download_file('<specify-the-local-filename>')

您可以找到文档 here。

Answer 2

parquet pip 模块，将做到这一点：https://pypi.org/project/parquet/。他们也有一个例子，复制到这里以供快速参考：

import parquet
import json

## assuming parquet file with two rows and three columns:
## foo bar baz
## 1   2   3
## 4   5   6

with open("test.parquet") as fo:
   # prints:
   # {"foo": 1, "bar": 2}
   # {"foo": 4, "bar": 5}
   for row in parquet.DictReader(fo, columns=['foo', 'bar']):
       print(json.dumps(row))

如何从 aws s3 存储桶中读取镶木地板文件并将它们保存为 jsons 在 jupyter 中

How to read parquet files from aws s3 bucket and save them as jsons in jupyter

amazon-s3

multi-factor-authentication