如何从 Python 中的 S3 读取 Avro 文件?
How to read Avro files from S3 in Python?
我有一堆 Avro 文件,我想从 S3 中一个一个地读取它们。我可以毫无问题地以字节形式读取文件,但我想知道之后如何遍历整个文件。当前代码:
conn = boto.s3.connect_to_region("us-east-1")
my_bucket=boto.s3.bucket.Bucket(conn, "my_bucket")
my_key = my_bucket.get_key("folder/file.avro")
raw_bytes = my_key.read()
test_schema = '''
{
"namespace": "com.company",
"type": "record",
"name": "MimeMessage_v2",
"fields": [
{
"name": "record_timestamp",
"type": "long"
},
{
"name": "contents",
"type": "bytes"
}
],
"message_id": 2
}
'''
schema = avro.schema.Parse(test_schema)
#this is the problematic section
dreader = DatumReader(schema, schema)
v = dreader.read(raw_bytes)
我想知道如何正确读取包含 Avro 文件字节的变量。
这是在 Python 3 中对我有用的方法之一:
from avro.datafile import DataFileReader
avro_bytes = io.BytesIO(raw_bytes)
reader = DataFileReader(avro_bytes, avro.io.DatumReader())
for line in reader:
print(line)
我有一堆 Avro 文件,我想从 S3 中一个一个地读取它们。我可以毫无问题地以字节形式读取文件,但我想知道之后如何遍历整个文件。当前代码:
conn = boto.s3.connect_to_region("us-east-1")
my_bucket=boto.s3.bucket.Bucket(conn, "my_bucket")
my_key = my_bucket.get_key("folder/file.avro")
raw_bytes = my_key.read()
test_schema = '''
{
"namespace": "com.company",
"type": "record",
"name": "MimeMessage_v2",
"fields": [
{
"name": "record_timestamp",
"type": "long"
},
{
"name": "contents",
"type": "bytes"
}
],
"message_id": 2
}
'''
schema = avro.schema.Parse(test_schema)
#this is the problematic section
dreader = DatumReader(schema, schema)
v = dreader.read(raw_bytes)
我想知道如何正确读取包含 Avro 文件字节的变量。
这是在 Python 3 中对我有用的方法之一:
from avro.datafile import DataFileReader
avro_bytes = io.BytesIO(raw_bytes)
reader = DataFileReader(avro_bytes, avro.io.DatumReader())
for line in reader:
print(line)