如何合并 PySpark 中的间隔

How to consolidate intervals in PySpark

我需要有关折叠每个组内相互重叠的时间间隔的帮助。

具体来说,这是我的:

id time_start time_end
1 8:00 9:00
1 8:30 9:30
1 9:45 10:00
2 8:00 9:00
2 8:30 8:40

这就是我想要的:

id time_start time_end
1 8:00 9:30
1 9:45 10:00
2 8:00 9:00

数据量很大,需要作为Spark dataframe处理。

感谢任何帮助!

您可以如下添加分组列:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'time_start', F.lpad('time_start', 5, '0')
).withColumn(
    'time_end', F.lpad('time_end', 5, '0')
).withColumn(
    'overlap', 
    F.when(
        F.max('time_end').over(
            Window.partitionBy('id')
                  .orderBy('time_start')
                  .rowsBetween(Window.unboundedPreceding, -1)
        ) >= F.col('time_start'), 
        0
    ).otherwise(1)
).withColumn(
    'group', 
    F.sum('overlap').over(Window.partitionBy('id').orderBy('time_start'))
).groupBy('id', 'group').agg(
    F.min('time_start').alias('time_start'), 
    F.max('time_end').alias('time_end')
).drop('group')

df2.show()
+---+----------+--------+
| id|time_start|time_end|
+---+----------+--------+
|  1|     08:00|   09:30|
|  1|     09:45|   10:00|
|  2|     08:00|   09:00|
+---+----------+--------+

分组前的幕后花絮:

+---+----------+--------+-------+-----+
| id|time_start|time_end|overlap|group|
+---+----------+--------+-------+-----+
|  1|     08:00|   09:00|      1|    1|
|  1|     08:30|   09:30|      0|    1|
|  1|     09:45|   10:00|      1|    2|
|  2|     08:00|   09:00|      1|    1|
|  2|     08:30|   08:40|      0|    1|
+---+----------+--------+-------+-----+