如何将 parquet 文件的 int64 数据类型列转换为 SparkSQL 数据帧中的时间戳?
How to convert int64 datatype columns of parquet file to timestamp in SparkSQL data frame?
我的 DataFrame 如下所示:
+----------------+-------------+
| Business_Date| Code|
+----------------+-------------+
|1539129600000000| BSD|
|1539129600000000| BTN|
|1539129600000000| BVI|
|1539129600000000| BWP|
|1539129600000000| BYB|
+----------------+-------------+
我想在将数据加载到配置单元 table 时将 Business_Date
列从 bigint
转换为 timestamp
值。
我该怎么做?
您可以使用 pyspark.sql.functions.from_unixtime()
这将
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
看来你的Business_Date
需要除以1M才能换算成秒。
例如:
from pyspark.sql.functions import from_unixtime, col
df = df.withColumn(
"Business_Date",
from_unixtime(col("Business_Date")/1000000).cast("timestamp")
)
df.show()
#+---------------------+----+
#|Business_Date |Code|
#+---------------------+----+
#|2018-10-09 20:00:00.0|BSD |
#|2018-10-09 20:00:00.0|BTN |
#|2018-10-09 20:00:00.0|BVI |
#|2018-10-09 20:00:00.0|BWP |
#|2018-10-09 20:00:00.0|BYB |
#+---------------------+----+
from_unixtime
returns 一个字符串,因此您可以将结果转换为 timestamp
.
现在新架构:
df.printSchema()
#root
# |-- Business_Date: timestamp (nullable = true)
# |-- Code: string (nullable = true)
我的 DataFrame 如下所示:
+----------------+-------------+
| Business_Date| Code|
+----------------+-------------+
|1539129600000000| BSD|
|1539129600000000| BTN|
|1539129600000000| BVI|
|1539129600000000| BWP|
|1539129600000000| BYB|
+----------------+-------------+
我想在将数据加载到配置单元 table 时将 Business_Date
列从 bigint
转换为 timestamp
值。
我该怎么做?
您可以使用 pyspark.sql.functions.from_unixtime()
这将
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
看来你的Business_Date
需要除以1M才能换算成秒。
例如:
from pyspark.sql.functions import from_unixtime, col
df = df.withColumn(
"Business_Date",
from_unixtime(col("Business_Date")/1000000).cast("timestamp")
)
df.show()
#+---------------------+----+
#|Business_Date |Code|
#+---------------------+----+
#|2018-10-09 20:00:00.0|BSD |
#|2018-10-09 20:00:00.0|BTN |
#|2018-10-09 20:00:00.0|BVI |
#|2018-10-09 20:00:00.0|BWP |
#|2018-10-09 20:00:00.0|BYB |
#+---------------------+----+
from_unixtime
returns 一个字符串,因此您可以将结果转换为 timestamp
.
现在新架构:
df.printSchema()
#root
# |-- Business_Date: timestamp (nullable = true)
# |-- Code: string (nullable = true)