Free text parsing using long regex formula leading to error: multiple repeat in python? Screenshot included

Question

我需要从 .xlsx 文件的自由文本字段中解析特定字符串。我在 Spyder 中使用 Python 2.7。

我避开了'.'在正则表达式公式中，但我仍然遇到相同的错误。

为此，我使用 pandas 将 .xslx 文件转换为 pandas 数据帧：

data = "complaints_data.xlsx"
read_data = pd.read_excel(data)
read_data.dropna(inplace = False)
df = pd.DataFrame(read_data)
df['FMEA Assessment'] = df['FMEA Assessment'].replace({',':''}, regex=True)

然后，我使用 pandas 的提取功能使用正则表达式模式提取我的字符串字段 FMEA、Rev 和 Line。

fmea_pattern = r'(FMEA\s*\d*\d*\d*\d*\d*|fmea\s*\d*\d*\d*\d*\d*|DOC\s*\-*[0]\d*\d*\d*\d*\d*|doc\s*\-*[0]\d*\d*\d*\d*\d*)'
df[['FMEA']] = df['FMEA Assessment'].str.extract(fmea_pattern, expand=True)
    
rev_pattern = r'(Rev\.*\s+\D{1,2}+|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
df[['REV']] = df['FMEA Assessment'].str.extract(rev_pattern, expand=True)
    


line_pattern = r'(line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINES\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINE\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.)'
df[['LINE']] = df['FMEA Assessment'].str.extract(line_pattern, expand=True)

我需要解析的字符串字段可以通过多种方式输入，我在正则表达式公式中考虑了每种方式以及单词的每种变体；例如，我计算了 line、Line、LINE、lines、Lines 等。我已经分别单独测试了正则表达式公式，它们工作正常。但是，当我在上面的代码中组合所有这些时，我收到以下错误消息：

此外，是否有另一种方法可以同时解释同一个词的变体（小写、大写和标题大写）？

Answer 1

本例中的主要错误是由于您使用的是所有格量词而不是常规的非所有格量词。

当用户在在线 PCRE 正则表达式测试器中测试他们的模式时，这是一个常见的错误。您需要确保始终在与您的目标环境兼容的环境（或使用正则表达式引擎选项）中测试您的正则表达式。

Python re 不支持所有格量词:

{5}+
{5,}+
{5,10}+
++
?+
*+

在这种情况下，您只需要从 \D{1,2}+:

中删除结尾的 +

rev_pattern = r'(Rev\.*\s+\D{1,2}|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'

看来你可以直接使用

rev_pattern = r'((?:[Rr]ev|REV)\.*\s+\D{1,2})' # Will only match Rev, REV and rev at the start
rev_pattern = r'(?i)(Rev\.*\s+\D{1,2})' # Will match any case variations of Rev

查看 Regex101 中的 regex demo，注意左侧选择的 Python 选项。

此外，请注意，可以通过在模式开头添加 (?i) 或使用 re.I 或 re.IGNORECASE 编译正则表达式来使整个模式不区分大小写争论。这将 "account for variations of the same word at the same time(lower case, upper case and title case)".

注意：如果您真的想使用所有格量词，您可以 emulate a possessive quantifier 借助正前瞻和反向引用。但是，在 Python 中，您需要 re.finditer 才能访问整个匹配值。

Free text parsing using long regex formula leading to error: multiple repeat in python? Screenshot included

Free text parsing using long regex formula leading to error: multiple repeat in python? Screenshot included

python

regex

parsing

pandas

spyder