如何在 pyspark 中使用 rlike 使用多个正则表达式模式

Question

我必须使用多种模式来过滤大文件。问题是我不确定使用 rlike 应用多个模式的有效方法。举个例子

df = spark.createDataFrame(
    [
        ('www 17 north gate',),
        ('aaa 45 north gate',),
        ('bbb 56 west gate',),
        ('ccc 56 south gate',),
        ('Michigan gate',),
        ('Statue of Liberty',),
        ('57 adam street',),
        ('19 west main street',),
        ('street burger',)
    ],
    [ 'poi']
)

df.show()
+-------------------+
|                poi|
+-------------------+
|  www 17 north gate|
|  aaa 45 north gate|
|   bbb 56 west gate|
|  ccc 56 south gate|
|      Michigan gate|
|  Statue of Liberty|
|     57 adam street|
|19 west main street|
|      street burger|
+-------------------+

如果我使用数据中的以下两种模式，我可以做到

pat1="(aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$"
pat2="[0-9]+ [a-z\s]+ street$"
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).show()
+-----------------+
|              poi|
+-----------------+
|www 45 north gate|
|    Michigan gate|
|Statue of Liberty|
|    street burger|
+-----------------+

如果我有 40 种不同的图案呢？我想我可以使用这样的循环

for pat in [pat1,pat2,....,patn]:
    df = df.filter(~df['poi'].rlike(pat))

这是正确的方法吗？原始数据是中文的，所以模式是否有效请忽略。我只是想看看我如何处理多个正则表达式模式。

Answer 1

您建议的两种方法具有相同的执行计划：

连续使用两种模式：

df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$ && 
#         NOT poi#297 RLIKE (aaa|bbb|ccc) [0-#9]+ (north|south|west|east) gate$)
#+- Scan ExistingRDD[poi#297]

使用循环：

# this is the same as your loop
df_new = reduce(lambda df, pat: df.filter(~df['poi'].rlike(pat)), [pat1, pat2], df)
df_new.explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$ && 
#         NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$)
#+- Scan ExistingRDD[poi#297]

另一种方法是使用 "|".join() 将所有模式与正则表达式 or 运算符链接在一起，将它们组合成一个。主要区别在于，这将导致只调用一次 rlike（与另一种方法中每个模式调用一次相反）：

df.filter(~df['poi'].rlike("|".join([pat1, pat2]))).explain()
#== Physical Plan ==
#*Filter NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$|[0-9]+ [a-#z\s]+ street$
#+- Scan ExistingRDD[poi#297]

如何在 pyspark 中使用 rlike 使用多个正则表达式模式

How to use multiple regex patterns using rlike in pyspark

apache-spark

pyspark

pyspark-sql