PySpark 不会转换时间戳

Question

我有一个非常简单的 CSV，称之为 test.csv

name,timestamp,action
A,2012-10-12 00:30:00.0000000,1
B,2012-10-12 01:00:00.0000000,2 
C,2012-10-12 01:30:00.0000000,2 
D,2012-10-12 02:00:00.0000000,3 
E,2012-10-12 02:30:00.0000000,1

我正在尝试使用 pyspark 阅读它并添加一个指示月份的新列。

首先我读入了数据，一切正常。

df = spark.read.csv('test.csv', inferSchema=True, header=True)
df.printSchema()
df.show()

输出：

root
 |-- name: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- action: double (nullable = true)

+----+-------------------+------+
|name|          timestamp|action|
+----+-------------------+------+
|   A|2012-10-12 00:30:00|   1.0|
|   B|2012-10-12 01:00:00|   2.0|
|   C|2012-10-12 01:30:00|   2.0|
|   D|2012-10-12 02:00:00|   3.0|
|   E|2012-10-12 02:30:00|   1.0|
+----+-------------------+------+

但是当我尝试添加我的专栏时，格式化选项似乎没有任何作用。

df.withColumn('month', to_date(col('timestamp'), format='MMM')).show()

输出：

+----+-------------------+------+----------+
|name|          timestamp|action|     month|
+----+-------------------+------+----------+
|   A|2012-10-12 00:30:00|   1.0|2012-10-12|
|   B|2012-10-12 01:00:00|   2.0|2012-10-12|
|   C|2012-10-12 01:30:00|   2.0|2012-10-12|
|   D|2012-10-12 02:00:00|   3.0|2012-10-12|
|   E|2012-10-12 02:30:00|   1.0|2012-10-12|
+----+-------------------+------+----------+

这是怎么回事？

Answer 1

to_date 和 format 用于解析字符串类型的列。你需要的是date_format

from pyspark.sql.functions import date_format

df.withColumn('month', date_format(col('timestamp'), format='MMM')).show()

# +----+-------------------+------+-----+
# |name|          timestamp|action|month|
# +----+-------------------+------+-----+
# |   A|2012-10-12 00:30:00|   1.0|  Oct|
# |   B|2012-10-12 01:00:00|   2.0|  Oct|
# |   C|2012-10-12 01:30:00|   2.0|  Oct|
# |   D|2012-10-12 02:00:00|   3.0|  Oct|
# |   E|2012-10-12 02:30:00|   1.0|  Oct|
# +----+-------------------+------+-----+

PySpark 不会转换时间戳

PySpark won't convert timestamp

python

simpledateformat

apache-spark

pyspark

jupyter-notebook