如何在 AWS Glue 作业中从 S3 逐行读取 CSV 文件

Question

大家好，我是 AWS 的新手。

我正在尝试检索存储在 s3 存储桶中的 5gb csv 文件，对其执行 ETL 并使用 AWS Glue 将其加载到 DynamoDB table。我的胶水工作是纯粹的 python bash shell 不使用 spark.

我的问题是当我尝试检索文件时。我收到文件未找到异常。这是我的代码：

import boto3
import logging
import csv
import s3fs

from boto3 import client
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError

csv_file_path = 's3://my_s3_bucket/mycsv_file.csv'

在我的 class 中向下几行......:[=12=]

with open(self.csv_file_path, "r") as input:
       csv_reader = csv.reader(input, delimiter='^', quoting=csv.QUOTE_NONE)

       for row in csv_reader:

在 with open 函数中我找不到文件。即使它在那里。我真的不想用pandas。我们在 glue 中使用 pandas 时遇到了问题。由于这是一个 5GB 的文件，我无法将其存储在内存中，所以我尝试打开它并逐行读取它。

我非常感谢这方面的帮助。

我还有正确的 IAM 胶水权限设置和一切。

Answer 1

我想通了

你必须使用 boto 的 s3 模块

s3 = boto3.client('s3')

file = s3.get_object(Bucket='bucket_name', Key='file_name')

lines = file['Body'].read().decode('utf-8').splitlines(True)

csv_reader = csv.reader(lines, delimiter=',', quoting=csv.QUOTE_NONE)

然后为 csv 创建一个 for 循环 reader

如何在 AWS Glue 作业中从 S3 逐行读取 CSV 文件

How do I read row by row of a CSV file from S3 in AWS Glue Job

amazon-web-services

aws-lambda

aws-glue