如何将 parquet 文件的 int64 数据类型列转换为 SparkSQL 数据帧中的时间戳？

Question

我的 DataFrame 如下所示：

+----------------+-------------+
|   Business_Date|         Code|
+----------------+-------------+
|1539129600000000|          BSD|
|1539129600000000|          BTN|
|1539129600000000|          BVI|
|1539129600000000|          BWP|
|1539129600000000|          BYB|
+----------------+-------------+

我想在将数据加载到配置单元 table 时将 Business_Date 列从 bigint 转换为 timestamp 值。

我该怎么做？

Answer 1

您可以使用 pyspark.sql.functions.from_unixtime() 这将

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

看来你的Business_Date需要除以1M才能换算成秒。

例如：

from pyspark.sql.functions import from_unixtime, col

df = df.withColumn(
    "Business_Date",
    from_unixtime(col("Business_Date")/1000000).cast("timestamp")
)
df.show()
#+---------------------+----+
#|Business_Date        |Code|
#+---------------------+----+
#|2018-10-09 20:00:00.0|BSD |
#|2018-10-09 20:00:00.0|BTN |
#|2018-10-09 20:00:00.0|BVI |
#|2018-10-09 20:00:00.0|BWP |
#|2018-10-09 20:00:00.0|BYB |
#+---------------------+----+

from_unixtime returns 一个字符串，因此您可以将结果转换为 timestamp.

现在新架构：

df.printSchema()
#root
# |-- Business_Date: timestamp (nullable = true)
# |-- Code: string (nullable = true)

如何将 parquet 文件的 int64 数据类型列转换为 SparkSQL 数据帧中的时间戳？

How to convert int64 datatype columns of parquet file to timestamp in SparkSQL data frame?

hive

apache-spark

apache-spark-sql

pyspark

pyspark-sql