通过时间戳 scala 更新数据帧值

Question

我有这个数据框

+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid|     |  event                      | A                  | B            |    C           |
+----------------+-----------------------------+--------------------+--------------+----------------+
|     1222222    | 2019-02-07 06:50:40.0       |aaaaaa              | 25           | 5025           |
|     1222222    | 2019-02-07 06:50:42.0       |aaaaaa              | 35           | 5000           |
|     1222222    | 2019-02-07 06:51:56.0       |aaaaaa              | 100          | 4965           |
+----------------+-----------------------------+--------------------+--------------+----------------+

我想通过事件 (tiemstamp) 更新 C 列的值，并且只保留具有最新值更新的行，就像这样

+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid|     |  event                      | A                  | B            |    C           |
+----------------+-----------------------------+--------------------+--------------+----------------+
|     1222222    | 2019-02-07 06:51:56.0       |aaaaaa              | 100          | 4965           |
+----------------+-----------------------------+--------------------+--------------+----------------+

数据通过 Spark Streaming 进入流模式

Answer 1

您可以尝试创建按 customerid 分区并按事件 desc 排序的行号，并获取 rownum 为 1 的行。希望这对您有所帮助。

df.withColumn("rownum", row_number().over(Window.partitionBy("customerid").orderBy(col("event").desc)))
    .filter(col("rownum") === 1)
    .drop("rownum")

通过时间戳 scala 更新数据帧值

update dataframe value by timestamp scala

scala

bigdata

dataframe

apache-spark

spark-streaming