如何将时间戳类型的 PySpark 数据帧截断到当天?
How do I truncate a PySpark dataframe of timestamp type to the day?
我有一个 PySpark 数据框,其中包含列中的时间戳(称为列 'dt'),如下所示:
2018-04-07 16:46:00
2018-03-06 22:18:00
当我执行时:
SELECT trunc(dt, 'day') as day
...我预计:
2018-04-07 00:00:00
2018-03-06 00:00:00
但是我得到了:
null
null
如何截断到天而不是小时?
一种简单的字符串操作方法:
from pyspark.sql.functions import lit, concat
df = df.withColumn('date', concat(df.date.substr(0, 10), lit(' 00:00:00')))
您使用了错误的功能。 trunc
supports only a few formats:
Returns date truncated to the unit specified by the format.
:param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'
Returns timestamp truncated to the unit specified by the format.
:param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm',
'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter'
示例:
from pyspark.sql.functions import col, date_trunc
df = spark.createDataFrame(["2018-04-07 23:33:21"], "string").toDF("dt").select(col("dt").cast("timestamp"))
df.select(date_trunc("day", "dt")).show()
# +-------------------+
# |date_trunc(day, dt)|
# +-------------------+
# |2018-04-07 00:00:00|
# +-------------------+
对于 spark <= 2.2.0
请使用这个:
from pyspark.sql.functions import weekofyear, year, to_date, concat, lit, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType
spark = SparkSession.builder.getOrCreate()
spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
.withColumn('timestamp', col('timestamp').astype(TimestampType())) \
.withColumn('date', to_date('timestamp').astype(TimestampType())) \
.show(truncate=False)
+-------------------+-------------------+
|timestamp |date |
+-------------------+-------------------+
|2020-10-03 05:00:00|2020-10-03 00:00:00|
+-------------------+-------------------+
对于 spark > 2.2.0 datetime patterns in spark 3.0.0
from pyspark.sql.functions import date_trunc, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType
spark = SparkSession.builder.getOrCreate()
spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
.withColumn('timestamp', col('timestamp').astype(TimestampType())) \
.withColumn('date', date_trunc(timestamp='timestamp', format='day')) \
.show(truncate=False)
+-------------------+-------------------+
|timestamp |date |
+-------------------+-------------------+
|2020-10-03 05:00:00|2020-10-03 00:00:00|
+-------------------+-------------------+
我有一个 PySpark 数据框,其中包含列中的时间戳(称为列 'dt'),如下所示:
2018-04-07 16:46:00
2018-03-06 22:18:00
当我执行时:
SELECT trunc(dt, 'day') as day
...我预计:
2018-04-07 00:00:00
2018-03-06 00:00:00
但是我得到了:
null
null
如何截断到天而不是小时?
一种简单的字符串操作方法:
from pyspark.sql.functions import lit, concat
df = df.withColumn('date', concat(df.date.substr(0, 10), lit(' 00:00:00')))
您使用了错误的功能。 trunc
supports only a few formats:
Returns date truncated to the unit specified by the format.
:param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'
Returns timestamp truncated to the unit specified by the format.
:param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm', 'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter'
示例:
from pyspark.sql.functions import col, date_trunc
df = spark.createDataFrame(["2018-04-07 23:33:21"], "string").toDF("dt").select(col("dt").cast("timestamp"))
df.select(date_trunc("day", "dt")).show()
# +-------------------+
# |date_trunc(day, dt)|
# +-------------------+
# |2018-04-07 00:00:00|
# +-------------------+
对于 spark <= 2.2.0
请使用这个:
from pyspark.sql.functions import weekofyear, year, to_date, concat, lit, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType
spark = SparkSession.builder.getOrCreate()
spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
.withColumn('timestamp', col('timestamp').astype(TimestampType())) \
.withColumn('date', to_date('timestamp').astype(TimestampType())) \
.show(truncate=False)
+-------------------+-------------------+
|timestamp |date |
+-------------------+-------------------+
|2020-10-03 05:00:00|2020-10-03 00:00:00|
+-------------------+-------------------+
对于 spark > 2.2.0 datetime patterns in spark 3.0.0
from pyspark.sql.functions import date_trunc, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import TimestampType
spark = SparkSession.builder.getOrCreate()
spark.createDataFrame([['2020-10-03 05:00:00']], schema=['timestamp']) \
.withColumn('timestamp', col('timestamp').astype(TimestampType())) \
.withColumn('date', date_trunc(timestamp='timestamp', format='day')) \
.show(truncate=False)
+-------------------+-------------------+
|timestamp |date |
+-------------------+-------------------+
|2020-10-03 05:00:00|2020-10-03 00:00:00|
+-------------------+-------------------+