如何合并 PySpark 中的间隔
How to consolidate intervals in PySpark
我需要有关折叠每个组内相互重叠的时间间隔的帮助。
具体来说,这是我的:
id
time_start
time_end
1
8:00
9:00
1
8:30
9:30
1
9:45
10:00
2
8:00
9:00
2
8:30
8:40
这就是我想要的:
id
time_start
time_end
1
8:00
9:30
1
9:45
10:00
2
8:00
9:00
数据量很大,需要作为Spark dataframe处理。
感谢任何帮助!
您可以如下添加分组列:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'time_start', F.lpad('time_start', 5, '0')
).withColumn(
'time_end', F.lpad('time_end', 5, '0')
).withColumn(
'overlap',
F.when(
F.max('time_end').over(
Window.partitionBy('id')
.orderBy('time_start')
.rowsBetween(Window.unboundedPreceding, -1)
) >= F.col('time_start'),
0
).otherwise(1)
).withColumn(
'group',
F.sum('overlap').over(Window.partitionBy('id').orderBy('time_start'))
).groupBy('id', 'group').agg(
F.min('time_start').alias('time_start'),
F.max('time_end').alias('time_end')
).drop('group')
df2.show()
+---+----------+--------+
| id|time_start|time_end|
+---+----------+--------+
| 1| 08:00| 09:30|
| 1| 09:45| 10:00|
| 2| 08:00| 09:00|
+---+----------+--------+
分组前的幕后花絮:
+---+----------+--------+-------+-----+
| id|time_start|time_end|overlap|group|
+---+----------+--------+-------+-----+
| 1| 08:00| 09:00| 1| 1|
| 1| 08:30| 09:30| 0| 1|
| 1| 09:45| 10:00| 1| 2|
| 2| 08:00| 09:00| 1| 1|
| 2| 08:30| 08:40| 0| 1|
+---+----------+--------+-------+-----+
我需要有关折叠每个组内相互重叠的时间间隔的帮助。
具体来说,这是我的:
id | time_start | time_end |
---|---|---|
1 | 8:00 | 9:00 |
1 | 8:30 | 9:30 |
1 | 9:45 | 10:00 |
2 | 8:00 | 9:00 |
2 | 8:30 | 8:40 |
这就是我想要的:
id | time_start | time_end |
---|---|---|
1 | 8:00 | 9:30 |
1 | 9:45 | 10:00 |
2 | 8:00 | 9:00 |
数据量很大,需要作为Spark dataframe处理。
感谢任何帮助!
您可以如下添加分组列:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'time_start', F.lpad('time_start', 5, '0')
).withColumn(
'time_end', F.lpad('time_end', 5, '0')
).withColumn(
'overlap',
F.when(
F.max('time_end').over(
Window.partitionBy('id')
.orderBy('time_start')
.rowsBetween(Window.unboundedPreceding, -1)
) >= F.col('time_start'),
0
).otherwise(1)
).withColumn(
'group',
F.sum('overlap').over(Window.partitionBy('id').orderBy('time_start'))
).groupBy('id', 'group').agg(
F.min('time_start').alias('time_start'),
F.max('time_end').alias('time_end')
).drop('group')
df2.show()
+---+----------+--------+
| id|time_start|time_end|
+---+----------+--------+
| 1| 08:00| 09:30|
| 1| 09:45| 10:00|
| 2| 08:00| 09:00|
+---+----------+--------+
分组前的幕后花絮:
+---+----------+--------+-------+-----+
| id|time_start|time_end|overlap|group|
+---+----------+--------+-------+-----+
| 1| 08:00| 09:00| 1| 1|
| 1| 08:30| 09:30| 0| 1|
| 1| 09:45| 10:00| 1| 2|
| 2| 08:00| 09:00| 1| 1|
| 2| 08:30| 08:40| 0| 1|
+---+----------+--------+-------+-----+