如何使用 Pyspark 加载雪花 table 并且我的 Dataframe 的日期列应反映为 TIMESTAMP_LTZ 格式

Question

如果我想将 Dataframe 写入 snowflake table，考虑到 table 在 snowflake 中已经不存在，而且我的 Dataframe 中的时间戳列应该反映为 TIMESTAMP_LTZ 存储时以雪花形式格式化。

注意：我不会在 snowflake 中将时间戳数据格式更改为 TIMESTAMP_LTZ，我希望一切都在我的 spark 代码本身中发生。

编辑：

我看到的行为是雪花 table 的数据类型为 TIMESTAMP_NTZ

Answer 1

The behaviour I'm seeing is that the snowflake table has a datatype of TIMESTAMP_NTZ

遵守 Snowflake 的 Spark Connector documentation:

中描述的默认行为

"The default timestamp data type mapping is TIMESTAMP_NTZ (no time zone), so you must explicitly set the TIMESTAMP_TYPE_MAPPING parameter to use TIMESTAMP_LTZ."

Spark Connector 中的 TIMESTAMP 数据类型映射将映射到 TIMESTAMP_LTZ 基础类型而不是 TIMESTAMP_NTZ 如果已将其明确指定为会话级参数（TIMESTAMP_TYPE_MAPPING) 在执行 CREATE/INSERT 操作之前。

会话级别 参数can be expressed in Spark code 并且不需要永久更改帐户上的任何设置。在 Spark 代码中与 Snowflake 交互时，只需将它作为一个选项添加到传递的选项映射中。下面是一个简单的例子：

sfOptions += ("TIMESTAMP_TYPE_MAPPING" -> "TIMESTAMP_LTZ")
// Pass this adjusted sfOptions to the .options(…) when writing the DataFrame

Answer 2

对我来说，通过在雪花读取操作之前添加以下内容解决了这个问题：

java.util.TimeZone.setDefault(java.util.TimeZone.getTimeZone("UTC"))

@Harish J 已经解释了原因，在Snowflake Documentation https://docs.snowflake.com/en/user-guide/spark-connector-use.html.

中也提到了同样的原因

如何使用 Pyspark 加载雪花 table 并且我的 Dataframe 的日期列应反映为 TIMESTAMP_LTZ 格式

How can I load a snowflake table using Pyspark and the date column of my Dataframe should reflect as TIMESTAMP_LTZ format

apache-spark

apache-spark-sql

pyspark

snowflake-cloud-data-platform

azure-databricks