数据分组问题,但基于 "window"
data grouping question but based on a "window"
全部,
我有一个定义如下的数据集:
eno|date|attendance
1|01-Jan-2010|P
1|02-Jan-2010|P
1|03-Jan-2010|A
1|04-Jan-2010|P
1|05-Jan-2010|P
2|01-Jan-2010|P
2|02-Jan-2010|P
2|03-Jan-2010|P
2|04-Jan-2010|A
2|05-Jan-2010|P
对于每个员工,要求是创建一个 "interval group",它基本上按时间顺序 对出勤值 进行分组。组是将相似的出勤值分组在一起,直到看到新的出勤值。所以预期的输出是:
eno|date|attendance|attendanceGroup
1|01-Jan-2010|P|1
1|02-Jan-2010|P|1
1|03-Jan-2010|A|2
1|04-Jan-2010|P|3
1|05-Jan-2010|P|3
2|01-Jan-2010|P|1
2|02-Jan-2010|P|1
2|03-Jan-2010|P|1
2|04-Jan-2010|A|2
2|05-Jan-2010|P|3
到现在为止我所能做的就是获取前一行的出勤值,但完全不知道如何从这里开始......提前非常感谢..
from datetime import datetime, timedelta
EmployeeAttendance = Row("eno", "date", "attendance")
EmpAttRowList = [EmployeeAttendance("1", datetime.now().date() - timedelta(days=100), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=99), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=98), "N"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=95), "N"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=94), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=93), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=100), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=99), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=98), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=95), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=94), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=93), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=92), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=91), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=90), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=95), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=94), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=93), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=92), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=91), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=90), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=89), "Y")
]
df = spark.createDataFrame(EmpAttRowList, EmployeeAttendance)
window = Window.partitionBy(df['eno']).orderBy("date")
previousrowattendance = lag(df["attendance"]).over(window)
考虑到您已经使用上述代码创建了数据框,您可以使用以下代码获取 attendanceGroup。让我知道它是否有效。
import pyspark.sql.functions as F
from pyspark.sql import Window
winSpec = Window.partitionBy('eno').orderBy('date')
df_unique = df.withColumn('prevAttendance', F.lag('attendance').over(winSpec))
df_unique = df_unique.filter((df_unique.attendance != df_unique.prevAttendance) | F.col('prevAttendance').isNull())
df_unique = df_unique.withColumn('attendanceGroup', F.row_number().over(winSpec))
df_unique = df_unique.withColumnRenamed('eno', 'eno_t').withColumnRenamed('date', 'date_t').drop('attendance').drop('prevAttendance')
df = df.join(df_unique, (df.eno == df_unique.eno_t) & (df.date == df_unique.date_t), 'left').drop('eno_t').drop('date_t')
df = df.withColumn('attendanceGroup', F.last('attendanceGroup', ignorenulls = True).over(winSpec))
df.orderBy('eno', 'date').show(10, False)
+---+----------+----------+---------------+
|eno|date |attendance|attendanceGroup|
+---+----------+----------+---------------+
|1 |2019-08-16|Y |1 |
|1 |2019-08-17|Y |1 |
|1 |2019-08-18|N |2 |
|1 |2019-08-19|Y |3 |
|1 |2019-08-20|Y |3 |
|1 |2019-08-21|N |4 |
|1 |2019-08-22|Y |5 |
|1 |2019-08-23|Y |5 |
|2 |2019-08-16|Y |1 |
|2 |2019-08-17|Y |1 |
+---+----------+----------+---------------+
only showing top 10 rows
你可以试试这个:
创建一个带有条件 attendance != lag(attendance)
的 grp
标志,以便于对标志求和
创建一个新的Window分区由原来的ideno
和新创建的grp
标志列,并应用一个sum
,和加1基本上是从1开始计数.
window = Window.partitionBy("eno").orderBy("date")
df = df.withColumn('grp', F.when(F.col("attendance") != F.lag(F.col("attendance")).over(window), 1).otherwise(0))
df = df.withColumn("group", 1 + F.sum(F.col("grp")).over(Window.partitionBy("eno", "grp").orderBy("date"))).drop("grp").orderBy("eno", "date")
输出
+---+----------+----------+-----+
|eno| date|attendance|group|
+---+----------+----------+-----+
| 1|2019-08-17| Y| 1|
| 1|2019-08-18| Y| 1|
| 1|2019-08-19| N| 2|
| 1|2019-08-20| Y| 3|
| 1|2019-08-21| Y| 1|
| 1|2019-08-22| N| 4|
| 1|2019-08-23| Y| 5|
| 1|2019-08-24| Y| 1|
| 2|2019-08-17| Y| 1|
| 2|2019-08-18| Y| 1|
| 2|2019-08-19| N| 2|
| 2|2019-08-20| Y| 3|
| 2|2019-08-21| Y| 1|
| 2|2019-08-22| N| 4|
| 2|2019-08-23| N| 1|
| 2|2019-08-24| N| 1|
| 2|2019-08-25| Y| 5|
| 2|2019-08-26| Y| 1|
| 2|2019-08-27| N| 6|
| 3|2019-08-20| Y| 1|
+---+----------+----------+-----+
only showing top 20 rows
全部,
我有一个定义如下的数据集:
eno|date|attendance
1|01-Jan-2010|P
1|02-Jan-2010|P
1|03-Jan-2010|A
1|04-Jan-2010|P
1|05-Jan-2010|P
2|01-Jan-2010|P
2|02-Jan-2010|P
2|03-Jan-2010|P
2|04-Jan-2010|A
2|05-Jan-2010|P
对于每个员工,要求是创建一个 "interval group",它基本上按时间顺序 对出勤值 进行分组。组是将相似的出勤值分组在一起,直到看到新的出勤值。所以预期的输出是:
eno|date|attendance|attendanceGroup
1|01-Jan-2010|P|1
1|02-Jan-2010|P|1
1|03-Jan-2010|A|2
1|04-Jan-2010|P|3
1|05-Jan-2010|P|3
2|01-Jan-2010|P|1
2|02-Jan-2010|P|1
2|03-Jan-2010|P|1
2|04-Jan-2010|A|2
2|05-Jan-2010|P|3
到现在为止我所能做的就是获取前一行的出勤值,但完全不知道如何从这里开始......提前非常感谢..
from datetime import datetime, timedelta
EmployeeAttendance = Row("eno", "date", "attendance")
EmpAttRowList = [EmployeeAttendance("1", datetime.now().date() - timedelta(days=100), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=99), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=98), "N"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=95), "N"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=94), "Y"),
EmployeeAttendance("1", datetime.now().date() - timedelta(days=93), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=100), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=99), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=98), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=95), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=94), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=93), "N"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=92), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=91), "Y"),
EmployeeAttendance("2", datetime.now().date() - timedelta(days=90), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=97), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=96), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=95), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=94), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=93), "N"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=92), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=91), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=90), "Y"),
EmployeeAttendance("3", datetime.now().date() - timedelta(days=89), "Y")
]
df = spark.createDataFrame(EmpAttRowList, EmployeeAttendance)
window = Window.partitionBy(df['eno']).orderBy("date")
previousrowattendance = lag(df["attendance"]).over(window)
考虑到您已经使用上述代码创建了数据框,您可以使用以下代码获取 attendanceGroup。让我知道它是否有效。
import pyspark.sql.functions as F
from pyspark.sql import Window
winSpec = Window.partitionBy('eno').orderBy('date')
df_unique = df.withColumn('prevAttendance', F.lag('attendance').over(winSpec))
df_unique = df_unique.filter((df_unique.attendance != df_unique.prevAttendance) | F.col('prevAttendance').isNull())
df_unique = df_unique.withColumn('attendanceGroup', F.row_number().over(winSpec))
df_unique = df_unique.withColumnRenamed('eno', 'eno_t').withColumnRenamed('date', 'date_t').drop('attendance').drop('prevAttendance')
df = df.join(df_unique, (df.eno == df_unique.eno_t) & (df.date == df_unique.date_t), 'left').drop('eno_t').drop('date_t')
df = df.withColumn('attendanceGroup', F.last('attendanceGroup', ignorenulls = True).over(winSpec))
df.orderBy('eno', 'date').show(10, False)
+---+----------+----------+---------------+
|eno|date |attendance|attendanceGroup|
+---+----------+----------+---------------+
|1 |2019-08-16|Y |1 |
|1 |2019-08-17|Y |1 |
|1 |2019-08-18|N |2 |
|1 |2019-08-19|Y |3 |
|1 |2019-08-20|Y |3 |
|1 |2019-08-21|N |4 |
|1 |2019-08-22|Y |5 |
|1 |2019-08-23|Y |5 |
|2 |2019-08-16|Y |1 |
|2 |2019-08-17|Y |1 |
+---+----------+----------+---------------+
only showing top 10 rows
你可以试试这个:
创建一个带有条件
attendance != lag(attendance)
的grp
标志,以便于对标志求和创建一个新的Window分区由原来的id
eno
和新创建的grp
标志列,并应用一个sum
,和加1基本上是从1开始计数.
window = Window.partitionBy("eno").orderBy("date")
df = df.withColumn('grp', F.when(F.col("attendance") != F.lag(F.col("attendance")).over(window), 1).otherwise(0))
df = df.withColumn("group", 1 + F.sum(F.col("grp")).over(Window.partitionBy("eno", "grp").orderBy("date"))).drop("grp").orderBy("eno", "date")
输出
+---+----------+----------+-----+
|eno| date|attendance|group|
+---+----------+----------+-----+
| 1|2019-08-17| Y| 1|
| 1|2019-08-18| Y| 1|
| 1|2019-08-19| N| 2|
| 1|2019-08-20| Y| 3|
| 1|2019-08-21| Y| 1|
| 1|2019-08-22| N| 4|
| 1|2019-08-23| Y| 5|
| 1|2019-08-24| Y| 1|
| 2|2019-08-17| Y| 1|
| 2|2019-08-18| Y| 1|
| 2|2019-08-19| N| 2|
| 2|2019-08-20| Y| 3|
| 2|2019-08-21| Y| 1|
| 2|2019-08-22| N| 4|
| 2|2019-08-23| N| 1|
| 2|2019-08-24| N| 1|
| 2|2019-08-25| Y| 5|
| 2|2019-08-26| Y| 1|
| 2|2019-08-27| N| 6|
| 3|2019-08-20| Y| 1|
+---+----------+----------+-----+
only showing top 20 rows