pyspark 从数据框列中获取年、月、季度和季度月数

Question

我输入了两列 partner_id 和 month_id （格式为 STRING - YYMM）

partner_id|month_id|
1001      |  2001  |
1002      |  2002  |
1003      |  2003  |
1001      |  2004  |
1002      |  2005  |
1003      |  2006  |
1001      |  2007  |
1002      |  2008  |
1003      |  2009  |
1003      |  2010  |
1003      |  2011  |
1003      |  2012  |

要求输出：

partner_id|month_id|month_num|year|qtr_num|qtr_month_num|
1001      |  2001  |01       |2020|1      |1            |
1002      |  2002  |02       |2020|1      |2            |
1003      |  2003  |03       |2020|1      |3            |
1001      |  2004  |04       |2020|2      |1            |
1002      |  2005  |05       |2020|2      |2            |
1003      |  2006  |06       |2020|2      |3            |
1001      |  2007  |07       |2020|3      |1            |
1002      |  2008  |08       |2020|3      |2            |
1003      |  2009  |09       |2020|3      |3            |
1003      |  2010  |10       |2020|4      |1            |
1003      |  2011  |11       |2020|4      |2            |
1003      |  2012  |12       |2020|4      |3            |

我想从 month_id 列创建这些新列。我使用了 data_format() 函数但没有得到正确的结果，因为它 month_id 列是字符串类型，特别是它是 YYMM 格式。我们如何根据 month_id 获得所需输出中所述的新四列？？？

Answer 1

可以使用date_format function to create most of your columns. But this function use the java SimpleDate format. Quarter is not available。您必须使用月份编号编写自己的代码。

这是您的操作方法：

df.withColumn("date_col", F.to_timestamp("month_id", "yyMM")).select(
    "partner_id",
    "month_id",
    F.date_format("date_col", "MM").alias("month_num"),
    F.date_format("date_col", "YYYY").alias("year"),
    ((F.date_format("date_col", "MM") + 2) / 3).cast("int").alias("qtr_num"),
    (((F.date_format("date_col", "MM") - 1) % 3) + 1)
    .cast("int")
    .alias("qtr_month_num"),
).show()


+----------+--------+---------+----+-------+-------------+
|partner_id|month_id|month_num|year|qtr_num|qtr_month_num|
+----------+--------+---------+----+-------+-------------+
|      1001|    2001|       01|2020|      1|            1|
|      1002|    2002|       02|2020|      1|            2|
|      1003|    2003|       03|2020|      1|            3|
|      1001|    2004|       04|2020|      2|            1|
|      1002|    2005|       05|2020|      2|            2|
|      1003|    2006|       06|2020|      2|            3|
|      1001|    2007|       07|2020|      3|            1|
|      1002|    2008|       08|2020|      3|            2|
|      1003|    2009|       09|2020|      3|            3|
|      1003|    2010|       10|2020|      4|            1|
|      1003|    2011|       11|2020|      4|            2|
|      1003|    2012|       12|2020|      4|            3|
+----------+--------+---------+----+-------+-------------+

pyspark 从数据框列中获取年、月、季度和季度月数

pyspark get year, month, quarter and quarter month number from a dataframe column

python

date

dataframe

pyspark

pyspark-dataframes