从 s3 读取时出现溢出错误 - 有符号整数大于最大值

Overflowerror when reading from s3 - signed integer is greater than maximum

使用以下代码将大文件从 S3 (>5GB) 读取到 lambda 中:

import json
import boto3

s3 = boto3.client('s3')

def lambda_handler(event, context):
    
    response = s3.get_object(
        Bucket="my-bucket",
        Key="my-key"
    )
    
    text_bytes = response['Body'].read()

    ...
    
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

但是我收到以下错误:

"errorMessage": "signed integer is greater than maximum"
"errorType": "OverflowError"
"stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 13, in lambda_handler\n    text_bytes = response['Body'].read()\n"
    "  File \"/var/runtime/botocore/response.py\", line 77, in read\n    chunk = self._raw_stream.read(amt)\n"
    "  File \"/var/runtime/urllib3/response.py\", line 515, in read\n    data = self._fp.read() if not fp_closed else b\"\"\n"
    "  File \"/var/lang/lib/python3.8/http/client.py\", line 472, in read\n    s = self._safe_read(self.length)\n"
    "  File \"/var/lang/lib/python3.8/http/client.py\", line 613, in _safe_read\n    data = self.fp.read(amt)\n"
    "  File \"/var/lang/lib/python3.8/socket.py\", line 669, in readinto\n    return self._sock.recv_into(b)\n"
    "  File \"/var/lang/lib/python3.8/ssl.py\", line 1241, in recv_into\n    return self.read(nbytes, buffer)\n"
    "  File \"/var/lang/lib/python3.8/ssl.py\", line 1099, in read\n    return self._sslobj.read(len, buffer)\n"
  ]

我正在使用 Python 3.8,我在这里发现了 Python 3.8/9 的问题,这可能是原因:https://bugs.python.org/issue42853

有什么解决办法吗?

如您链接到的错误中所述,Python 3.8 中的核心问题是一次读取超过 1gb 的错误。您可以使用错误中建议的解决方法的变体来分块读取文件。

import boto3

s3 = boto3.client('s3')

def lambda_handler(event, context):
    response = s3.get_object(
        Bucket="-example-bucket-",
        Key="path/to/key.dat"
    )
    buf = bytearray(response['ContentLength'])
    view = memoryview(buf)
    pos = 0
    while True:
        chunk = response['Body'].read(67108864)
        if len(chunk) == 0:
            break
        view[pos:pos+len(chunk)] = chunk
        pos += len(chunk)
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

然而,充其量,每个 Lambda 运行 您将花费一分钟或更多时间来读取 S3。如果您可以将文件存储在 EFS 中并在 Lambda 中从那里读取它,或者使用其他解决方案(如 ECS)来避免从远程数据源读取,那就更好了。