组合pyspark中的日期/时间字符串列以获得一个日期时间列

Combining date / time string columns in pyspark to get one datetime column

我需要用日期时间做减法以获得经过时间列。我能够将单独的日期和时间列组合成两个组合列,称为 pickup 和 dropoff。但是,我无法成功地将这些列放入日期时间类型的列中。下面,'pickup' 和 'dropoff' 是字符串。有没有办法将这些列转换为日期时间类型?

我一直在苦苦挣扎,因为这不包括 am/pm。 pyspark 数据框如下所示。谢谢!

df.show()

+-----------+-----------+------------+------------+--------+----+-----+-------------+-------------+
|pickup_date|pickup_time|dropoff_date|dropoff_time|distance| tip| fare|       pickup|      dropoff|
+-----------+-----------+------------+------------+--------+----+-----+-------------+-------------+
|   1/1/2017|       0:00|    1/1/2017|        0:00|    0.02|   0| 52.8|1/1/2017 0:00|1/1/2017 0:00|
|   1/1/2017|       0:00|    1/1/2017|        0:03|     0.5|   0|  5.3|1/1/2017 0:00|1/1/2017 0:03|
|   1/1/2017|       0:00|    1/1/2017|        0:39|    7.75|4.66|27.96|1/1/2017 0:00|1/1/2017 0:39|
|   1/1/2017|       0:00|    1/1/2017|        0:06|     0.8|1.45| 8.75|1/1/2017 0:00|1/1/2017 0:06|
|   1/1/2017|       0:00|    1/1/2017|        0:08|     0.9|   0|  8.3|1/1/2017 0:00|1/1/2017 0:08|
|   1/1/2017|       0:00|    1/1/2017|        0:05|    1.76|   0|  8.3|1/1/2017 0:00|1/1/2017 0:05|
|   1/1/2017|       0:00|    1/1/2017|        0:15|    8.47|7.71|38.55|1/1/2017 0:00|1/1/2017 0:15|
|   1/1/2017|       0:00|    1/1/2017|        0:11|     2.4|   0| 11.8|1/1/2017 0:00|1/1/2017 0:11|

将字符串时间戳转换为时间戳数据类型并相减。

代码:

import org.apache.spark.sql.functions.{col, to_timestamp}
import org.apache.spark.sql.types.{LongType, TimestampType}

  val data = Seq(("1/1/2017 0:00", "1/1/2017 0:35"))

  val df = data.toDF("pickup_dt", "drop_dt")

  df
    .withColumn("pickup_dt", to_timestamp(col("pickup_dt"), "d/M/yyyy H:mm"))
    .withColumn("drop_dt", to_timestamp(col("drop_dt"), "d/M/yyyy H:mm"))
    .withColumn("diff", (col("drop_dt").cast(LongType) - col("pickup_dt").cast(LongType)) / 60)
    .show(false)

输出:

+-------------------+-------------------+----+
|pickup_dt          |drop_dt            |diff|
+-------------------+-------------------+----+
|2017-01-01 00:00:00|2017-01-01 00:35:00|35.0|
+-------------------+-------------------+----+

Pyspark:

from pyspark.sql.functions import col, to_timestamp
  df.withColumn(
    "diff",
    (col("drop_dt").cast("long") - col("pickup_dt").cast("long"))/60.
  ).show(truncate=False)