为什么 Spark 不自动检测我的 Parquet 文件中的新字段?
Why Is Spark NOT auto-detecting new fields in my Parquet files?
在 Databricks 博客的以下摘录中,声称作为 Spark 1.3,如果随着时间的推移将新字段添加到 parquet 模式,它们将被自动检测和处理(我假设通过插入 NULL在该字段被引入镶木地板文件之前的时间段内)。
此功能对我不起作用——例如,如果我使用此命令读取所有月份的数据:
df=spark.read.parquet('/mnt/waldo/mixpanel/formatted/parquet/')
然后尝试查询新添加的字段之一,截至 8 月,未找到。
但是,如果我只是阅读那个月的数据:
df=spark.read.parquet('/mnt/waldo/mixpanel/formatted/parquet/eventmonth=2018-08-01')
那么那个字段就在那里供查询。
知道我做错了什么吗?谢谢!
In the Apache Spark 1.3 release we added two major features to this source. First, organizations that store lots of data in parquet often find themselves evolving the schema over time by adding or removing columns. With this release we add a new feature that will scan the metadata for all files, merging the schemas to come up with a unified representation of the data. This functionality allows developers to read data where the schema has changed overtime, without the need to perform expensive manual conversions.
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
读取Parquet文件时,需要特别要求在需要时合并schema;否则作为速度优化,Spark 将只读取遇到的第一个分区的模式并假设所有分区都相同。
使用:
df=spark.read.option("mergeSchema","true").parquet('/mnt/waldo/mixpanel/formatted/parquet/')
在 Databricks 博客的以下摘录中,声称作为 Spark 1.3,如果随着时间的推移将新字段添加到 parquet 模式,它们将被自动检测和处理(我假设通过插入 NULL在该字段被引入镶木地板文件之前的时间段内)。
此功能对我不起作用——例如,如果我使用此命令读取所有月份的数据:
df=spark.read.parquet('/mnt/waldo/mixpanel/formatted/parquet/')
然后尝试查询新添加的字段之一,截至 8 月,未找到。
但是,如果我只是阅读那个月的数据:
df=spark.read.parquet('/mnt/waldo/mixpanel/formatted/parquet/eventmonth=2018-08-01')
那么那个字段就在那里供查询。
知道我做错了什么吗?谢谢!
In the Apache Spark 1.3 release we added two major features to this source. First, organizations that store lots of data in parquet often find themselves evolving the schema over time by adding or removing columns. With this release we add a new feature that will scan the metadata for all files, merging the schemas to come up with a unified representation of the data. This functionality allows developers to read data where the schema has changed overtime, without the need to perform expensive manual conversions. https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
读取Parquet文件时,需要特别要求在需要时合并schema;否则作为速度优化,Spark 将只读取遇到的第一个分区的模式并假设所有分区都相同。
使用:
df=spark.read.option("mergeSchema","true").parquet('/mnt/waldo/mixpanel/formatted/parquet/')