从 S3 读取 ORC 文件到 Pandas
Read ORC file from S3 to Pandas
我正在尝试将 orc 文件从 s3 读取到 Pandas 数据帧中。在我的 pandas 版本中没有 pd.read_orc(...).
我试过这样做:
session = boto3.Session()
s3_client = session.client('s3')
s3_key = "my_object_key"
data = s3_client.get_object(
Bucket='my_bucket',
Key=s3_key
)
orc_bytes = data['Body'].read()
以字节形式读取对象。
现在我尝试这样做:
orc_data = pyorc.Reader(orc_bytes)
但它失败了,因为:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-deaabe8232ce> in <module>
----> 1 data = pyorc.Reader(orc_data)
/anaconda3/envs/linear_opt_3.7/lib/python3.7/site-packages/pyorc/reader.py in __init__(self, fileo, batch_size, column_indices, column_names, struct_repr, converters)
65 conv = converters
66 super().__init__(
---> 67 fileo, batch_size, column_indices, column_names, struct_repr, conv
68 )
69
TypeError: Parameter must be a file-like object, but `<class 'bytes'>` was provided
最终我想将其作为 .csv 格式或我可以读入的内容 pandas。有更好的方法吗?
尝试将 S3 数据包装在 io.BytesIO
:
import io
orc_bytes = io.BytesIO(data['Body'].read())
orc_data = pyorc.Reader(orc_bytes)
这是解决问题的函数:
import boto3
import pyorc
import io
import pandas as pd
session = boto3.Session()
s3_client = session.client('s3')
def load_s3_orc_to_local_df(key, bucket):
data = s3_client.get_object(Bucket=bucket, Key=key)
orc_bytes = io.BytesIO(data['Body'].read())
reader = pyorc.Reader(orc_bytes)
schema = reader.schema
columns = [item for item in schema.fields]
rows = [row for row in reader]
df = pd.DataFrame(data=rows, columns=columns)
return df
我正在尝试将 orc 文件从 s3 读取到 Pandas 数据帧中。在我的 pandas 版本中没有 pd.read_orc(...).
我试过这样做:
session = boto3.Session()
s3_client = session.client('s3')
s3_key = "my_object_key"
data = s3_client.get_object(
Bucket='my_bucket',
Key=s3_key
)
orc_bytes = data['Body'].read()
以字节形式读取对象。
现在我尝试这样做:
orc_data = pyorc.Reader(orc_bytes)
但它失败了,因为:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-deaabe8232ce> in <module>
----> 1 data = pyorc.Reader(orc_data)
/anaconda3/envs/linear_opt_3.7/lib/python3.7/site-packages/pyorc/reader.py in __init__(self, fileo, batch_size, column_indices, column_names, struct_repr, converters)
65 conv = converters
66 super().__init__(
---> 67 fileo, batch_size, column_indices, column_names, struct_repr, conv
68 )
69
TypeError: Parameter must be a file-like object, but `<class 'bytes'>` was provided
最终我想将其作为 .csv 格式或我可以读入的内容 pandas。有更好的方法吗?
尝试将 S3 数据包装在 io.BytesIO
:
import io
orc_bytes = io.BytesIO(data['Body'].read())
orc_data = pyorc.Reader(orc_bytes)
这是解决问题的函数:
import boto3
import pyorc
import io
import pandas as pd
session = boto3.Session()
s3_client = session.client('s3')
def load_s3_orc_to_local_df(key, bucket):
data = s3_client.get_object(Bucket=bucket, Key=key)
orc_bytes = io.BytesIO(data['Body'].read())
reader = pyorc.Reader(orc_bytes)
schema = reader.schema
columns = [item for item in schema.fields]
rows = [row for row in reader]
df = pd.DataFrame(data=rows, columns=columns)
return df