如何在 pyspark 中使用 rlike 使用多个正则表达式模式
How to use multiple regex patterns using rlike in pyspark
我必须使用多种模式来过滤大文件。问题是我不确定使用 rlike
应用多个模式的有效方法。举个例子
df = spark.createDataFrame(
[
('www 17 north gate',),
('aaa 45 north gate',),
('bbb 56 west gate',),
('ccc 56 south gate',),
('Michigan gate',),
('Statue of Liberty',),
('57 adam street',),
('19 west main street',),
('street burger',)
],
[ 'poi']
)
df.show()
+-------------------+
| poi|
+-------------------+
| www 17 north gate|
| aaa 45 north gate|
| bbb 56 west gate|
| ccc 56 south gate|
| Michigan gate|
| Statue of Liberty|
| 57 adam street|
|19 west main street|
| street burger|
+-------------------+
如果我使用数据中的以下两种模式,我可以做到
pat1="(aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$"
pat2="[0-9]+ [a-z\s]+ street$"
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).show()
+-----------------+
| poi|
+-----------------+
|www 45 north gate|
| Michigan gate|
|Statue of Liberty|
| street burger|
+-----------------+
如果我有 40 种不同的图案呢?我想我可以使用这样的循环
for pat in [pat1,pat2,....,patn]:
df = df.filter(~df['poi'].rlike(pat))
这是正确的方法吗?原始数据是中文的,所以模式是否有效请忽略。我只是想看看我如何处理多个正则表达式模式。
您建议的两种方法具有相同的执行计划:
连续使用两种模式:
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$ &&
# NOT poi#297 RLIKE (aaa|bbb|ccc) [0-#9]+ (north|south|west|east) gate$)
#+- Scan ExistingRDD[poi#297]
使用循环:
# this is the same as your loop
df_new = reduce(lambda df, pat: df.filter(~df['poi'].rlike(pat)), [pat1, pat2], df)
df_new.explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$ &&
# NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$)
#+- Scan ExistingRDD[poi#297]
另一种方法是使用 "|".join()
将所有模式与正则表达式 or
运算符链接在一起,将它们组合成一个。主要区别在于,这将导致只调用一次 rlike
(与另一种方法中每个模式调用一次相反):
df.filter(~df['poi'].rlike("|".join([pat1, pat2]))).explain()
#== Physical Plan ==
#*Filter NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$|[0-9]+ [a-#z\s]+ street$
#+- Scan ExistingRDD[poi#297]
我必须使用多种模式来过滤大文件。问题是我不确定使用 rlike
应用多个模式的有效方法。举个例子
df = spark.createDataFrame(
[
('www 17 north gate',),
('aaa 45 north gate',),
('bbb 56 west gate',),
('ccc 56 south gate',),
('Michigan gate',),
('Statue of Liberty',),
('57 adam street',),
('19 west main street',),
('street burger',)
],
[ 'poi']
)
df.show()
+-------------------+
| poi|
+-------------------+
| www 17 north gate|
| aaa 45 north gate|
| bbb 56 west gate|
| ccc 56 south gate|
| Michigan gate|
| Statue of Liberty|
| 57 adam street|
|19 west main street|
| street burger|
+-------------------+
如果我使用数据中的以下两种模式,我可以做到
pat1="(aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$"
pat2="[0-9]+ [a-z\s]+ street$"
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).show()
+-----------------+
| poi|
+-----------------+
|www 45 north gate|
| Michigan gate|
|Statue of Liberty|
| street burger|
+-----------------+
如果我有 40 种不同的图案呢?我想我可以使用这样的循环
for pat in [pat1,pat2,....,patn]:
df = df.filter(~df['poi'].rlike(pat))
这是正确的方法吗?原始数据是中文的,所以模式是否有效请忽略。我只是想看看我如何处理多个正则表达式模式。
您建议的两种方法具有相同的执行计划:
连续使用两种模式:
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$ &&
# NOT poi#297 RLIKE (aaa|bbb|ccc) [0-#9]+ (north|south|west|east) gate$)
#+- Scan ExistingRDD[poi#297]
使用循环:
# this is the same as your loop
df_new = reduce(lambda df, pat: df.filter(~df['poi'].rlike(pat)), [pat1, pat2], df)
df_new.explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$ &&
# NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$)
#+- Scan ExistingRDD[poi#297]
另一种方法是使用 "|".join()
将所有模式与正则表达式 or
运算符链接在一起,将它们组合成一个。主要区别在于,这将导致只调用一次 rlike
(与另一种方法中每个模式调用一次相反):
df.filter(~df['poi'].rlike("|".join([pat1, pat2]))).explain()
#== Physical Plan ==
#*Filter NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$|[0-9]+ [a-#z\s]+ street$
#+- Scan ExistingRDD[poi#297]