Azure 函数 Python 写入 Azure DataLake Gen2

Azure Function Python write to Azure DataLake Gen2

我想使用 Azure Function 和 Python.

将文件写入我的 Azure DataLake Gen2

很遗憾,我遇到了以下身份验证问题:

Exception: ClientAuthenticationError: (InvalidAuthenticationInfo) Server failed to authenticate the request. Please refer to the information in the www-authenticate header.

'WWW-Authenticate': 'REDACTED'

我的帐户和 Function 应用程序都应该具有访问我分配的 DataLake 的必要角色。

这是我的功能:

import datetime
import logging

from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
import azure.functions as func

def main(mytimer: func.TimerRequest) -> None:
    utc_timestamp = datetime.datetime.utcnow().replace(
        tzinfo=datetime.timezone.utc).isoformat()

    if mytimer.past_due:
        logging.info('The timer is past due!')

    credential = DefaultAzureCredential()
    service_client = DataLakeServiceClient(account_url="https://<datalake_name>.dfs.core.windows.net", credential=credential)

    file_system_client = service_client.get_file_system_client(file_system="temp")
    directory_client = file_system_client.get_directory_client("test")
    file_client = directory_client.create_file("uploaded-file.txt")
    
    file_contents = 'some data'
    file_client.append_data(data=file_contents, offset=0, length=len(file_contents))
    file_client.flush_data(len(file_contents))


    logging.info('Python timer trigger function ran at %s', utc_timestamp)

我错过了什么?

THX 和 BR

彼得

问题似乎来自 DefaultAzureCredential。

DefaultAzureCredential 使用的标识取决于环境。当需要访问令牌时,它会依次使用这些身份请求一个,当一个提供令牌时停止:

1. A service principal configured by environment variables. 
2. An Azure managed identity. 
3. On Windows only: a user who has signed in with a Microsoft application, such as Visual Studio.
4. The user currently signed in to Visual Studio Code.
5. The identity currently logged in to the Azure CLI.

事实上,您完全可以在不使用默认凭据的情况下生成数据湖服务对象。您可以这样做(直接使用连接字符串连接):

import logging
import datetime

from azure.storage.filedatalake import DataLakeServiceClient
import azure.functions as func


def main(req: func.HttpRequest) -> func.HttpResponse:
    connect_str = "DefaultEndpointsProtocol=https;AccountName=0730bowmanwindow;AccountKey=xxxxxx;EndpointSuffix=core.windows.net"
    utc_timestamp = datetime.datetime.utcnow().replace(
        tzinfo=datetime.timezone.utc).isoformat()

    service_client = DataLakeServiceClient.from_connection_string(connect_str)

    file_system_client = service_client.get_file_system_client(file_system="test")
    directory_client = file_system_client.get_directory_client("test")
    file_client = directory_client.create_file("uploaded-file.txt")
    
    file_contents = 'some data'
    file_client.append_data(data=file_contents, offset=0, length=len(file_contents))
    file_client.flush_data(len(file_contents))

    return func.HttpResponse(
            "Test.",
            status_code=200
    )

另外,为了保证数据写入顺利,请检查您的datalake是否有访问限制。

Bowman Zhu 建议的函数有错误。根据 Azure documentation 参数“length”需要以字节为单位的长度。但是,建议的函数使用字符长度。其中一些字符可能由多个字节组成。在这种情况下,该函数不会将 file_contents 的所有字节写入文件,从而导致数据丢失!

因此,

file_client.append_data(data=file_contents, offset=0, length=len(file_contents))
file_client.flush_data(len(file_contents))

必须是这样的:

length = len(file_contents.encode())
file_client.append_data(data=file_contents, offset=0, length=length)
file_client.flush_data(offset=length)