如何从 Azure Python 函数 blob 输入绑定中读取镶木地板文件?
How to read parquet file from Azure Python function blob input binding?
我有一个带有 blob 输入绑定的 python 函数。有问题的 blob 包含一个 parquet 文件。最终我想将绑定的 blob 读入 pandas 数据帧,但我不确定这样做的正确方法。
我已验证绑定设置是否正确,并且我已经能够成功读取纯文本文件。我很高兴 parquet 文件的完整性很好,因为我已经能够使用此处提供的示例阅读它:https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage
以下代码显示了我正在尝试做的事情:
import logging
import io
import azure.functions as func
import pyarrow.parquet as pq
def main(req: func.HttpRequest, inputblob: func.InputStream) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
# Create a bytestream to hold blob content
byte_stream = io.BytesIO()
byte_stream.write(inputblob.read())
df = pq.read_table(source=byte_stream).to_pandas()
我收到以下错误消息:
pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data
以下是我的function.json文件:
{
"scriptFile": "__init__.py",
"bindings": [
{
"authLevel": "function",
"type": "httpTrigger",
"direction": "in",
"name": "req",
"methods": [
"get",
"post"
]
},
{
"type": "http",
"direction": "out",
"name": "$return"
},
{
"name": "inputblob",
"type": "blob",
"path": "<container>/file.parquet",
"connection": "AzureWebJobsStorage",
"direction": "in"
}
]
}
我的 host.json 文件:
{
"version": "2.0",
"functionTimeout": "00:10:00",
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[1.*, 2.0.0)"
}
}
我一直在处理同样的问题,这个解决方案对我有用。
__ini_.py 文件:
from io import BytesIO
import azure.functions as func
def main(blobTrigger: func.InputStream):
# Read the blob as bytes
blob_bytes = blobTrigger.read()
blob_to_read = BytesIO(blob_bytes)
df = pd.read_parquet(blob_to_read, engine='pyarrow')
print("Length of the parquet file:" + str(len(df.index)))
我有一个带有 blob 输入绑定的 python 函数。有问题的 blob 包含一个 parquet 文件。最终我想将绑定的 blob 读入 pandas 数据帧,但我不确定这样做的正确方法。
我已验证绑定设置是否正确,并且我已经能够成功读取纯文本文件。我很高兴 parquet 文件的完整性很好,因为我已经能够使用此处提供的示例阅读它:https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage
以下代码显示了我正在尝试做的事情:
import logging
import io
import azure.functions as func
import pyarrow.parquet as pq
def main(req: func.HttpRequest, inputblob: func.InputStream) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
# Create a bytestream to hold blob content
byte_stream = io.BytesIO()
byte_stream.write(inputblob.read())
df = pq.read_table(source=byte_stream).to_pandas()
我收到以下错误消息:
pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data
以下是我的function.json文件:
{
"scriptFile": "__init__.py",
"bindings": [
{
"authLevel": "function",
"type": "httpTrigger",
"direction": "in",
"name": "req",
"methods": [
"get",
"post"
]
},
{
"type": "http",
"direction": "out",
"name": "$return"
},
{
"name": "inputblob",
"type": "blob",
"path": "<container>/file.parquet",
"connection": "AzureWebJobsStorage",
"direction": "in"
}
]
}
我的 host.json 文件:
{
"version": "2.0",
"functionTimeout": "00:10:00",
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[1.*, 2.0.0)"
}
}
我一直在处理同样的问题,这个解决方案对我有用。
__ini_.py 文件:
from io import BytesIO
import azure.functions as func
def main(blobTrigger: func.InputStream):
# Read the blob as bytes
blob_bytes = blobTrigger.read()
blob_to_read = BytesIO(blob_bytes)
df = pd.read_parquet(blob_to_read, engine='pyarrow')
print("Length of the parquet file:" + str(len(df.index)))