如果一年中的所有月份都包含在列中,则创建布尔值 - Pyspark
Create boolean if all the months in a year are included in a column - Pyspark
我想创建一个布尔值列,如果特定日期列的子集包含一年中的所有月份,它 returns 是。
示例:
id date
a 2021-01-01
a 2021-02-01
...
a 2021-12-01
b 2021-02-01
b 2021-04-01
看起来像:
id date full_year
a 2021-01-01 yes
a 2021-02-01 yes
... ...
a 2021-12-01 yes
b 2021-02-01 no
b 2021-04-01 no
进口:
from pyspark.sql import functions as F, Window as W
代码:
w = W.partitionBy("id",F.year("date"))
out = (sdf.withColumn("date",F.to_date("date"))
.withColumn("CountYearMOnth",
F.size(F.collect_set(F.date_format("date","yyyyMM")).over(w)))
.withColumn("full_year",F.when(F.col("CountYearMOnth")==12,"yes").otherwise("No"))
.drop("CountYearMOnth")
)
逻辑:
- 按 id 和 year 列分区并创建一个 window (w)
- 将日期列转换为实际日期列(如果类型是日期列则忽略)
- 收集 window (w) 上的集合并获取日期列的大小,格式为 yyyymm,条件如下
- If size == 12, then assign Yes else assign No
或者,您可以将收集列表的大小替换为不同的近似计数:
w = W.partitionBy("id",F.year("date"))
out = (sdf.withColumn("date",F.to_date("date"))
.withColumn("CountYearMOnth",
F.approx_count_distinct(F.date_format("date","yyyyMM")).over(w))
.withColumn("full_year",F.when(F.col("CountYearMOnth")==12,"yes").otherwise("No"))
.drop("CountYearMOnth")
)
示例输出:
+---+----------+---------+
|id |date |full_year|
+---+----------+---------+
|a |2021-01-31|yes |
|a |2021-02-28|yes |
|a |2021-03-31|yes |
|a |2021-04-30|yes |
|a |2021-05-31|yes |
|a |2021-06-30|yes |
|a |2021-07-31|yes |
|a |2021-08-31|yes |
|a |2021-09-30|yes |
|a |2021-10-31|yes |
|a |2021-11-30|yes |
|a |2021-12-31|yes |
|a |2022-01-31|No |
|a |2022-02-28|No |
|a |2022-03-31|No |
|a |2022-04-30|No |
|b |2021-01-31|No |
|b |2021-02-28|No |
|b |2021-03-31|No |
|b |2021-04-30|No |
|b |2021-05-31|No |
|b |2021-06-30|No |
+---+----------+---------+
我想创建一个布尔值列,如果特定日期列的子集包含一年中的所有月份,它 returns 是。
示例:
id date
a 2021-01-01
a 2021-02-01
...
a 2021-12-01
b 2021-02-01
b 2021-04-01
看起来像:
id date full_year
a 2021-01-01 yes
a 2021-02-01 yes
... ...
a 2021-12-01 yes
b 2021-02-01 no
b 2021-04-01 no
进口:
from pyspark.sql import functions as F, Window as W
代码:
w = W.partitionBy("id",F.year("date"))
out = (sdf.withColumn("date",F.to_date("date"))
.withColumn("CountYearMOnth",
F.size(F.collect_set(F.date_format("date","yyyyMM")).over(w)))
.withColumn("full_year",F.when(F.col("CountYearMOnth")==12,"yes").otherwise("No"))
.drop("CountYearMOnth")
)
逻辑:
- 按 id 和 year 列分区并创建一个 window (w)
- 将日期列转换为实际日期列(如果类型是日期列则忽略)
- 收集 window (w) 上的集合并获取日期列的大小,格式为 yyyymm,条件如下
- If size == 12, then assign Yes else assign No
或者,您可以将收集列表的大小替换为不同的近似计数:
w = W.partitionBy("id",F.year("date"))
out = (sdf.withColumn("date",F.to_date("date"))
.withColumn("CountYearMOnth",
F.approx_count_distinct(F.date_format("date","yyyyMM")).over(w))
.withColumn("full_year",F.when(F.col("CountYearMOnth")==12,"yes").otherwise("No"))
.drop("CountYearMOnth")
)
示例输出:
+---+----------+---------+
|id |date |full_year|
+---+----------+---------+
|a |2021-01-31|yes |
|a |2021-02-28|yes |
|a |2021-03-31|yes |
|a |2021-04-30|yes |
|a |2021-05-31|yes |
|a |2021-06-30|yes |
|a |2021-07-31|yes |
|a |2021-08-31|yes |
|a |2021-09-30|yes |
|a |2021-10-31|yes |
|a |2021-11-30|yes |
|a |2021-12-31|yes |
|a |2022-01-31|No |
|a |2022-02-28|No |
|a |2022-03-31|No |
|a |2022-04-30|No |
|b |2021-01-31|No |
|b |2021-02-28|No |
|b |2021-03-31|No |
|b |2021-04-30|No |
|b |2021-05-31|No |
|b |2021-06-30|No |
+---+----------+---------+