如何合并 PySpark 中的间隔

Question

我需要有关折叠每个组内相互重叠的时间间隔的帮助。

具体来说，这是我的：

id	time_start	time_end
1	8:00	9:00
1	8:30	9:30
1	9:45	10:00
2	8:00	9:00
2	8:30	8:40

这就是我想要的：

id	time_start	time_end
1	8:00	9:30
1	9:45	10:00
2	8:00	9:00

数据量很大，需要作为Spark dataframe处理。

感谢任何帮助！

Answer 1

您可以如下添加分组列：

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'time_start', F.lpad('time_start', 5, '0')
).withColumn(
    'time_end', F.lpad('time_end', 5, '0')
).withColumn(
    'overlap', 
    F.when(
        F.max('time_end').over(
            Window.partitionBy('id')
                  .orderBy('time_start')
                  .rowsBetween(Window.unboundedPreceding, -1)
        ) >= F.col('time_start'), 
        0
    ).otherwise(1)
).withColumn(
    'group', 
    F.sum('overlap').over(Window.partitionBy('id').orderBy('time_start'))
).groupBy('id', 'group').agg(
    F.min('time_start').alias('time_start'), 
    F.max('time_end').alias('time_end')
).drop('group')

df2.show()
+---+----------+--------+
| id|time_start|time_end|
+---+----------+--------+
|  1|     08:00|   09:30|
|  1|     09:45|   10:00|
|  2|     08:00|   09:00|
+---+----------+--------+

分组前的幕后花絮：

+---+----------+--------+-------+-----+
| id|time_start|time_end|overlap|group|
+---+----------+--------+-------+-----+
|  1|     08:00|   09:00|      1|    1|
|  1|     08:30|   09:30|      0|    1|
|  1|     09:45|   10:00|      1|    2|
|  2|     08:00|   09:00|      1|    1|
|  2|     08:30|   08:40|      0|    1|
+---+----------+--------+-------+-----+

如何合并 PySpark 中的间隔

How to consolidate intervals in PySpark

intervals

apache-spark

apache-spark-sql

pyspark