在 pyspark 中查找正则表达式？

Question

I have a column in pyspark dataframe which contain values separated by ; 

+----------------------------------------------------------------------------------+
|name                                                                              |
+----------------------------------------------------------------------------------+
|tppid=dfc36cc18bba07ae2419a1501534aec6fdcc22e0dcefed4f58c48b0169f203f6;xmaslist=no|
+----------------------------------------------------------------------------------+

因此，如果我使用这个

，那么在此列中可以出现任意数量的键值对

df.withColumn('test', regexp_extract(col('name'), '(?<=tppid=)(.*?);', 1)).show(1,False)

我可以提取 tppid，但是当 tppid 作为连续的最后一个键值对时，它无法提取，我想要一个 regx，它可以在连续的位置提取键的值.

Answer 1

您可以使用否定字符 class [^;] 来匹配任何字符，但 ;:

tppid=([^;]+)

见regex demo

由于 regexp_extract 的第三个参数是 1（访问第 1 组内容），您可以放弃后向构造并使用 tppid= 作为消费模式的一部分。

Answer 2

除了 Wiktor Stribiżew 的回答之外，您还可以使用锚点。 $表示字符串结束。

tppid=\w+(?=;|\s|$)

另外 this 正则表达式只为您提取没有 tppid= 部分的值：

(?<=tppid=)\w+(?=;|\s|$)

在 pyspark 中查找正则表达式？

Finding a regx expression in pyspark?

python

regex

pyspark

pyspark-sql