从 s3 读取 json 文件以使用 glueContext.read.json 粘合 pyspark 会给出错误的结果

Question

有谁知道为什么 glueContext.read.json 给我一个错误的结果？基本上以下两种方法给我非常不同的结果。爆炸后，df2的记录数比df1少很多。有没有人有过同样的经历？谢谢！！

df1 = glueContext.create_dynamic_frame_from_options("s3", format="json", connection_options = {"paths": ["s3://.../"]})


df2 = glueContext.read.json("s3://.../",multiLine=True)

Answer 1

一般glueContext.create_dynamic_frame_from_options用于从源位置（大文件）分组读取文件，因此默认情况下它会考虑文件的所有分区。以下是语法：

df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://s3path/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'}, format="json")

这里groupSize是自定义的，你可以根据自己的需要改。

同时glueContext.read.json通常用于读取某个位置的特定文件。

因此，在您的情况下，glueContext.read.json 可能会在读取时丢失某些数据分区。这就是两个数据框中的大小和行数不同的原因。

从 s3 读取 json 文件以使用 glueContext.read.json 粘合 pyspark 会给出错误的结果

reading json files from s3 to glue pyspark with glueContext.read.json gives wrong result

json

amazon-web-services

pyspark

aws-glue