在 pyspark 数据框中删除连续的重复项
Drop consecutive duplicates in a pyspark dataframe
有一个像这样的数据框:
## +---+---+
## | id|num|
## +---+---+
## | 2|3.0|
## | 3|6.0|
## | 3|2.0|
## | 3|1.0|
## | 2|9.0|
## | 4|7.0|
## +---+---+
我想去掉连续的重复,得到:
## +---+---+
## | id|num|
## +---+---+
## | 2|3.0|
## | 3|6.0|
## | 2|9.0|
## | 4|7.0|
## +---+---+
我在 Pandas 中找到了 ways of doing this,但在 Pyspark 中找不到。
答案应该如您所愿,但可能还有一些优化空间:
from pyspark.sql.window import Window as W
test_df = spark.createDataFrame([
(2,3.0),(3,6.0),(3,2.0),(3,1.0),(2,9.0),(4,7.0)
], ("id", "num"))
test_df = test_df.withColumn("idx", monotonically_increasing_id()) # create temporary ID because window needs an ordered structure
w = W.orderBy("idx")
get_last= when(lag("id", 1).over(w) == col("id"), False).otherwise(True) # check if the previous row contains the same id
test_df.withColumn("changed",get_last).filter(col("changed")).select("id","num").show() # only select the rows with a changed ID
输出:
+---+---+
| id|num|
+---+---+
| 2|3.0|
| 3|6.0|
| 2|9.0|
| 4|7.0|
+---+---+
有一个像这样的数据框:
## +---+---+
## | id|num|
## +---+---+
## | 2|3.0|
## | 3|6.0|
## | 3|2.0|
## | 3|1.0|
## | 2|9.0|
## | 4|7.0|
## +---+---+
我想去掉连续的重复,得到:
## +---+---+
## | id|num|
## +---+---+
## | 2|3.0|
## | 3|6.0|
## | 2|9.0|
## | 4|7.0|
## +---+---+
我在 Pandas 中找到了 ways of doing this,但在 Pyspark 中找不到。
答案应该如您所愿,但可能还有一些优化空间:
from pyspark.sql.window import Window as W
test_df = spark.createDataFrame([
(2,3.0),(3,6.0),(3,2.0),(3,1.0),(2,9.0),(4,7.0)
], ("id", "num"))
test_df = test_df.withColumn("idx", monotonically_increasing_id()) # create temporary ID because window needs an ordered structure
w = W.orderBy("idx")
get_last= when(lag("id", 1).over(w) == col("id"), False).otherwise(True) # check if the previous row contains the same id
test_df.withColumn("changed",get_last).filter(col("changed")).select("id","num").show() # only select the rows with a changed ID
输出:
+---+---+
| id|num|
+---+---+
| 2|3.0|
| 3|6.0|
| 2|9.0|
| 4|7.0|
+---+---+