从 AWS 胶水上的 S3 读取 csv 和文本文件而不必将其作为动态数据帧读取的最佳方法是什么？

Question

我正在尝试读取 S3 存储桶中的一个 csv 文件。我想做一些操作，然后最终转换为动态数据帧并将其写回 S3。

这是我目前尝试过的方法：

纯Python:

     Val1=""
     Val2=""
     cols=[]
     width=[]
     with open('s3://demo-ETL/read/data.csv') as csvfile:
     readCSV = csv.reader(csvfile, delimiter=',')
     for row in readCSV:
         print(row)
              if ((Val1=="" ) & (Val2=="")):
                 Val1=row[0]
                 Val2=row[0]
                 cols.append(row[1])
                 width.append(int(row[4]))
    else:
         continues...

这里我得到一个错误，说它根本找不到目录中的文件。

Boto3:

     import boto3

     s3 = boto3.client('s3')
     data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
     contents = data['Body'].read()
     print(contents)
          for row in content:
               if ((Val1=="" ) & (Val2=="")):
                 Val1=row[0]
                 Val2=row[0]
                 cols.append(row[1])
                 width.append(int(row[4]))
    else:
    continues...

这里说索引超出范围，这很奇怪，因为我在 csv 文件中有 4 个逗号分隔值。当我查看 print(contents) 的结果时，我看到它将每个字符放在一个列表中，而不是将每个逗号分隔值放在一个列表中。

是否有更好的方法从 s3 读取 csv？

Answer 1

get_object returns the Body response value which is of type StreamingBody. Per the docs, if you're trying to go line-by-line you probably want to use iter_lines.

例如：

import boto3

s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
file_lines = data['Body'].iter_lines()
print(file_lines)

这可能会做更多你想要的。

Answer 2

我最终通过将其作为 pandas 数据框读取来解决了这个问题。我首先使用 boto3 创建了一个对象，然后将整个对象读取为一个 pd，然后将其转换为一个列表。

       s3 = boto3.resource('s3') 
       bucket = s3.Bucket('demo-ETL')
       obj = bucket.Object(key='read/data.csv') 
       dataFrame = pd.read_csv(obj.get()['Body'])
       l = dataFrame.values.tolist()
           for i in l:
           print(i)

Answer 3

您可以像这样使用 Spark 读取文件：

df = spark.read.\
           format("csv").\
           option("header", "true").\
           load("s3://bucket-name/file-name.csv")

您可以在此处找到更多选项：https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv

从 AWS 胶水上的 S3 读取 csv 和文本文件而不必将其作为动态数据帧读取的最佳方法是什么？

What is the best way to read a csv and text file from S3 on AWS glue without having to read it as a Dynamic daataframe?

amazon-s3

amazon-web-services

boto3

aws-glue