如何在不对时间戳列使用 INT96 格式的情况下将 spark 数据帧保存到镶木地板?
How to save spark dataframe to parquet without using INT96 format for timestamp columns?
我有一个 spark 数据框,我想将其保存为 parquet,然后使用 parquet-avro 库加载它。
我的数据框中有一个时间戳列被转换为镶木地板中的 INT96 时间戳列。然而 parquet-avro does not support INT96 格式和抛出。
有办法避免吗?在 avro 支持的东西中将时间戳写入 parquet 时,是否可以更改 Spark 使用的格式?
我目前使用
date_frame.write.parquet("path")
阅读 spark 代码我发现 spark.sql.parquet.outputTimestampType
property
spark.sql.parquet.outputTimestampType :
Sets which Parquet timestamp type to use when Spark writes data to Parquet files.
INT96 is a non-standard but commonly used timestamp type in Parquet.
TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch.
TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.
因此我可以执行以下操作:
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
data_frame.write.parquet("path")
我有一个 spark 数据框,我想将其保存为 parquet,然后使用 parquet-avro 库加载它。
我的数据框中有一个时间戳列被转换为镶木地板中的 INT96 时间戳列。然而 parquet-avro does not support INT96 格式和抛出。
有办法避免吗?在 avro 支持的东西中将时间戳写入 parquet 时,是否可以更改 Spark 使用的格式?
我目前使用
date_frame.write.parquet("path")
阅读 spark 代码我发现 spark.sql.parquet.outputTimestampType
property
spark.sql.parquet.outputTimestampType :
Sets which Parquet timestamp type to use when Spark writes data to Parquet files.
INT96 is a non-standard but commonly used timestamp type in Parquet.
TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch.
TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.
因此我可以执行以下操作:
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
data_frame.write.parquet("path")