组合pyspark中的日期/时间字符串列以获得一个日期时间列
Combining date / time string columns in pyspark to get one datetime column
我需要用日期时间做减法以获得经过时间列。我能够将单独的日期和时间列组合成两个组合列,称为 pickup 和 dropoff。但是,我无法成功地将这些列放入日期时间类型的列中。下面,'pickup' 和 'dropoff' 是字符串。有没有办法将这些列转换为日期时间类型?
我一直在苦苦挣扎,因为这不包括 am/pm。 pyspark 数据框如下所示。谢谢!
df.show()
+-----------+-----------+------------+------------+--------+----+-----+-------------+-------------+
|pickup_date|pickup_time|dropoff_date|dropoff_time|distance| tip| fare| pickup| dropoff|
+-----------+-----------+------------+------------+--------+----+-----+-------------+-------------+
| 1/1/2017| 0:00| 1/1/2017| 0:00| 0.02| 0| 52.8|1/1/2017 0:00|1/1/2017 0:00|
| 1/1/2017| 0:00| 1/1/2017| 0:03| 0.5| 0| 5.3|1/1/2017 0:00|1/1/2017 0:03|
| 1/1/2017| 0:00| 1/1/2017| 0:39| 7.75|4.66|27.96|1/1/2017 0:00|1/1/2017 0:39|
| 1/1/2017| 0:00| 1/1/2017| 0:06| 0.8|1.45| 8.75|1/1/2017 0:00|1/1/2017 0:06|
| 1/1/2017| 0:00| 1/1/2017| 0:08| 0.9| 0| 8.3|1/1/2017 0:00|1/1/2017 0:08|
| 1/1/2017| 0:00| 1/1/2017| 0:05| 1.76| 0| 8.3|1/1/2017 0:00|1/1/2017 0:05|
| 1/1/2017| 0:00| 1/1/2017| 0:15| 8.47|7.71|38.55|1/1/2017 0:00|1/1/2017 0:15|
| 1/1/2017| 0:00| 1/1/2017| 0:11| 2.4| 0| 11.8|1/1/2017 0:00|1/1/2017 0:11|
将字符串时间戳转换为时间戳数据类型并相减。
代码:
import org.apache.spark.sql.functions.{col, to_timestamp}
import org.apache.spark.sql.types.{LongType, TimestampType}
val data = Seq(("1/1/2017 0:00", "1/1/2017 0:35"))
val df = data.toDF("pickup_dt", "drop_dt")
df
.withColumn("pickup_dt", to_timestamp(col("pickup_dt"), "d/M/yyyy H:mm"))
.withColumn("drop_dt", to_timestamp(col("drop_dt"), "d/M/yyyy H:mm"))
.withColumn("diff", (col("drop_dt").cast(LongType) - col("pickup_dt").cast(LongType)) / 60)
.show(false)
输出:
+-------------------+-------------------+----+
|pickup_dt |drop_dt |diff|
+-------------------+-------------------+----+
|2017-01-01 00:00:00|2017-01-01 00:35:00|35.0|
+-------------------+-------------------+----+
Pyspark:
from pyspark.sql.functions import col, to_timestamp
df.withColumn(
"diff",
(col("drop_dt").cast("long") - col("pickup_dt").cast("long"))/60.
).show(truncate=False)
我需要用日期时间做减法以获得经过时间列。我能够将单独的日期和时间列组合成两个组合列,称为 pickup 和 dropoff。但是,我无法成功地将这些列放入日期时间类型的列中。下面,'pickup' 和 'dropoff' 是字符串。有没有办法将这些列转换为日期时间类型?
我一直在苦苦挣扎,因为这不包括 am/pm。 pyspark 数据框如下所示。谢谢!
df.show()
+-----------+-----------+------------+------------+--------+----+-----+-------------+-------------+
|pickup_date|pickup_time|dropoff_date|dropoff_time|distance| tip| fare| pickup| dropoff|
+-----------+-----------+------------+------------+--------+----+-----+-------------+-------------+
| 1/1/2017| 0:00| 1/1/2017| 0:00| 0.02| 0| 52.8|1/1/2017 0:00|1/1/2017 0:00|
| 1/1/2017| 0:00| 1/1/2017| 0:03| 0.5| 0| 5.3|1/1/2017 0:00|1/1/2017 0:03|
| 1/1/2017| 0:00| 1/1/2017| 0:39| 7.75|4.66|27.96|1/1/2017 0:00|1/1/2017 0:39|
| 1/1/2017| 0:00| 1/1/2017| 0:06| 0.8|1.45| 8.75|1/1/2017 0:00|1/1/2017 0:06|
| 1/1/2017| 0:00| 1/1/2017| 0:08| 0.9| 0| 8.3|1/1/2017 0:00|1/1/2017 0:08|
| 1/1/2017| 0:00| 1/1/2017| 0:05| 1.76| 0| 8.3|1/1/2017 0:00|1/1/2017 0:05|
| 1/1/2017| 0:00| 1/1/2017| 0:15| 8.47|7.71|38.55|1/1/2017 0:00|1/1/2017 0:15|
| 1/1/2017| 0:00| 1/1/2017| 0:11| 2.4| 0| 11.8|1/1/2017 0:00|1/1/2017 0:11|
将字符串时间戳转换为时间戳数据类型并相减。
代码:
import org.apache.spark.sql.functions.{col, to_timestamp}
import org.apache.spark.sql.types.{LongType, TimestampType}
val data = Seq(("1/1/2017 0:00", "1/1/2017 0:35"))
val df = data.toDF("pickup_dt", "drop_dt")
df
.withColumn("pickup_dt", to_timestamp(col("pickup_dt"), "d/M/yyyy H:mm"))
.withColumn("drop_dt", to_timestamp(col("drop_dt"), "d/M/yyyy H:mm"))
.withColumn("diff", (col("drop_dt").cast(LongType) - col("pickup_dt").cast(LongType)) / 60)
.show(false)
输出:
+-------------------+-------------------+----+
|pickup_dt |drop_dt |diff|
+-------------------+-------------------+----+
|2017-01-01 00:00:00|2017-01-01 00:35:00|35.0|
+-------------------+-------------------+----+
Pyspark:
from pyspark.sql.functions import col, to_timestamp
df.withColumn(
"diff",
(col("drop_dt").cast("long") - col("pickup_dt").cast("long"))/60.
).show(truncate=False)