PySpark - 在 LIKE 运算符中使用列表
PySpark - Using lists inside LIKE operator
我想在 pyspark 的 LIKE 运算符中使用列表来创建一个列。
我有以下输入 df :
input_df :
+------+--------------------+-------+
| ID| customers|country|
+------+--------------------+-------+
|161 |xyz Limited |U.K. |
|262 |ABC Limited |U.K. |
|165 |Sons & Sons |U.K. |
|361 |TÜV GmbH |Germany|
|462 |Mueller GmbH |Germany|
|369 |Schneider AG |Germany|
|467 |Sahm UG |Austria|
+------+--------------------+-------+
我想添加一列CAT_ID。如果 "ID" 包含“16”或“26”,CAT_ID 取值 1。如果 "ID" 包含“36”或“46”,CAT_ID 取值 2。
所以,我希望我的输出 df 看起来像这样 -
The desired output_df :
+------+--------------------+-------+-------+
| ID| customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161 |xyz Limited |U.K. |1 |
|262 |ABC Limited |U.K. |1 |
|165 |Sons & Sons |U.K. |1 |
|361 |TÜV GmbH |Germany|2 |
|462 |Mueller GmbH |Germany|2 |
|369 |Schneider AG |Germany|2 |
|467 |Sahm UG |Austria|2 |
+------+--------------------+-------+-------+
我有兴趣了解如何使用 LIKE 语句和列表来完成此操作。
我知道如何在没有列表的情况下实现它,效果很好:
from pyspark.sql import functions as F
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like('16%')) | (F.col('ID').like('26%')) ) , "1") \
.when( ( (F.col('ID').like('36%')) | (F.col('ID').like('46%')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
但是,我很想使用列表并拥有类似的东西:
list1 =['16', '26']
list2 =['36', '46']
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like(list1 %)) ) , "1") \
.when( ( (F.col('ID').like('list2 %')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
非常感谢,
SQL 通配符不支持 "or" 子句。不过,您可以通过多种方式处理它。
1.正则表达式
您可以将 rlike
与正则表达式一起使用:
import pyspark.sql.functions as psf
list1 =['16', '26']
list2 =['36', '46']
df.withColumn(
'CAT_ID',
psf.when(psf.col('ID').rlike('({})\d'.format('|'.join(list1))), '1') \
.when(psf.col('ID').rlike('({})\d'.format('|'.join(list2))), '2') \
.otherwise('999')) \
.show()
+---+------------+-------+------+
| ID| customers|country|CAT_ID|
+---+------------+-------+------+
|161| xyz Limited| U.K.| 1|
|262|ABC Limited| U.K.| 1|
|165| Sons & Sons| U.K.| 1|
|361| TÜV GmbH|Germany| 2|
|462|Mueller GmbH|Germany| 2|
|369|Schneider AG|Germany| 2|
|467| Sahm UG|Austria| 2|
+---+------------+-------+------+
在这里,我们得到 list1
的正则表达式 (16|26)\d
匹配 16 或 26 后跟一个整数(\d
等同于 [0-9]
)。
2。动态构建一个SQL子句
如果你想保持 sql 一样,你可以使用 selectExpr
并将值与 ' OR '
:
链接起来
df.selectExpr(
'*',
"CASE WHEN ({}) THEN '1' WHEN ({}) THEN '2' ELSE '999' END AS CAT_ID"
.format(*[' OR '.join(["ID LIKE '{}%'".format(x) for x in l]) for l in [list1, list2]]))
3。动态构建一个Python表达式
不想写SQL也可以用eval
:
df.withColumn(
'CAT_ID',
psf.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list1])), '1')
.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list2])), '2')
.otherwise('999'))
从 Spark 2.4 开始,您可以在 spark-sql.
中使用高阶函数
尝试下面的方法,sql 解决方案与 scala/python
相同
val df = Seq(
("161","xyz Limited","U.K."),
("262","ABC Limited","U.K."),
("165","Sons & Sons","U.K."),
("361","TÜV GmbH","Germany"),
("462","Mueller GmbH","Germany"),
("369","Schneider AG","Germany"),
("467","Sahm UG","Germany")
).toDF("ID","customers","country")
df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
""" with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
t2 (select id, customers, country, filter(a1, x -> id like x||'%') a1f, filter(a2, x -> id like x||'%') a2f from t1),
t3 (select id, customers, country, a1f, a2f,
case when size(a1f) > 0 then 1 else 0 end a1r,
case when size(a2f) > 0 then 2 else 0 end a2r
from t2)
select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
""").show(false)
结果:
+---+------------+-------+
|ID |customers |country|
+---+------------+-------+
|161|xyz Limited |U.K. |
|262|ABC Limited|U.K. |
|165|Sons & Sons |U.K. |
|361|TÜV GmbH |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG |Germany|
+---+------------+-------+
+---+------------+-------+----+----+---+---+------+
|id |customers |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K. |[16]|[] |1 |0 |1 |
|262|ABC Limited|U.K. |[26]|[] |1 |0 |1 |
|165|Sons & Sons |U.K. |[16]|[] |1 |0 |1 |
|361|TÜV GmbH |Germany|[] |[36]|0 |2 |2 |
|462|Mueller GmbH|Germany|[] |[46]|0 |2 |2 |
|369|Schneider AG|Germany|[] |[36]|0 |2 |2 |
|467|Sahm UG |Germany|[] |[46]|0 |2 |2 |
+---+------------+-------+----+----+---+---+------+
我想在 pyspark 的 LIKE 运算符中使用列表来创建一个列。
我有以下输入 df :
input_df :
+------+--------------------+-------+
| ID| customers|country|
+------+--------------------+-------+
|161 |xyz Limited |U.K. |
|262 |ABC Limited |U.K. |
|165 |Sons & Sons |U.K. |
|361 |TÜV GmbH |Germany|
|462 |Mueller GmbH |Germany|
|369 |Schneider AG |Germany|
|467 |Sahm UG |Austria|
+------+--------------------+-------+
我想添加一列CAT_ID。如果 "ID" 包含“16”或“26”,CAT_ID 取值 1。如果 "ID" 包含“36”或“46”,CAT_ID 取值 2。
所以,我希望我的输出 df 看起来像这样 -
The desired output_df :
+------+--------------------+-------+-------+
| ID| customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161 |xyz Limited |U.K. |1 |
|262 |ABC Limited |U.K. |1 |
|165 |Sons & Sons |U.K. |1 |
|361 |TÜV GmbH |Germany|2 |
|462 |Mueller GmbH |Germany|2 |
|369 |Schneider AG |Germany|2 |
|467 |Sahm UG |Austria|2 |
+------+--------------------+-------+-------+
我有兴趣了解如何使用 LIKE 语句和列表来完成此操作。
我知道如何在没有列表的情况下实现它,效果很好:
from pyspark.sql import functions as F
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like('16%')) | (F.col('ID').like('26%')) ) , "1") \
.when( ( (F.col('ID').like('36%')) | (F.col('ID').like('46%')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
但是,我很想使用列表并拥有类似的东西:
list1 =['16', '26']
list2 =['36', '46']
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like(list1 %)) ) , "1") \
.when( ( (F.col('ID').like('list2 %')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
非常感谢,
SQL 通配符不支持 "or" 子句。不过,您可以通过多种方式处理它。
1.正则表达式
您可以将 rlike
与正则表达式一起使用:
import pyspark.sql.functions as psf
list1 =['16', '26']
list2 =['36', '46']
df.withColumn(
'CAT_ID',
psf.when(psf.col('ID').rlike('({})\d'.format('|'.join(list1))), '1') \
.when(psf.col('ID').rlike('({})\d'.format('|'.join(list2))), '2') \
.otherwise('999')) \
.show()
+---+------------+-------+------+
| ID| customers|country|CAT_ID|
+---+------------+-------+------+
|161| xyz Limited| U.K.| 1|
|262|ABC Limited| U.K.| 1|
|165| Sons & Sons| U.K.| 1|
|361| TÜV GmbH|Germany| 2|
|462|Mueller GmbH|Germany| 2|
|369|Schneider AG|Germany| 2|
|467| Sahm UG|Austria| 2|
+---+------------+-------+------+
在这里,我们得到 list1
的正则表达式 (16|26)\d
匹配 16 或 26 后跟一个整数(\d
等同于 [0-9]
)。
2。动态构建一个SQL子句
如果你想保持 sql 一样,你可以使用 selectExpr
并将值与 ' OR '
:
df.selectExpr(
'*',
"CASE WHEN ({}) THEN '1' WHEN ({}) THEN '2' ELSE '999' END AS CAT_ID"
.format(*[' OR '.join(["ID LIKE '{}%'".format(x) for x in l]) for l in [list1, list2]]))
3。动态构建一个Python表达式
不想写SQL也可以用eval
:
df.withColumn(
'CAT_ID',
psf.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list1])), '1')
.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list2])), '2')
.otherwise('999'))
从 Spark 2.4 开始,您可以在 spark-sql.
中使用高阶函数尝试下面的方法,sql 解决方案与 scala/python
相同val df = Seq(
("161","xyz Limited","U.K."),
("262","ABC Limited","U.K."),
("165","Sons & Sons","U.K."),
("361","TÜV GmbH","Germany"),
("462","Mueller GmbH","Germany"),
("369","Schneider AG","Germany"),
("467","Sahm UG","Germany")
).toDF("ID","customers","country")
df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
""" with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
t2 (select id, customers, country, filter(a1, x -> id like x||'%') a1f, filter(a2, x -> id like x||'%') a2f from t1),
t3 (select id, customers, country, a1f, a2f,
case when size(a1f) > 0 then 1 else 0 end a1r,
case when size(a2f) > 0 then 2 else 0 end a2r
from t2)
select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
""").show(false)
结果:
+---+------------+-------+
|ID |customers |country|
+---+------------+-------+
|161|xyz Limited |U.K. |
|262|ABC Limited|U.K. |
|165|Sons & Sons |U.K. |
|361|TÜV GmbH |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG |Germany|
+---+------------+-------+
+---+------------+-------+----+----+---+---+------+
|id |customers |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K. |[16]|[] |1 |0 |1 |
|262|ABC Limited|U.K. |[26]|[] |1 |0 |1 |
|165|Sons & Sons |U.K. |[16]|[] |1 |0 |1 |
|361|TÜV GmbH |Germany|[] |[36]|0 |2 |2 |
|462|Mueller GmbH|Germany|[] |[46]|0 |2 |2 |
|369|Schneider AG|Germany|[] |[36]|0 |2 |2 |
|467|Sahm UG |Germany|[] |[46]|0 |2 |2 |
+---+------------+-------+----+----+---+---+------+