什么是最好的正则表达式来替换 python 中某个短语前面的字符串中的非数字字符？

Question

我必须解析经过 OCR 识别并在其中包含 invalid/incorrect 个字符的名称、地址等列表，并且在州邮政编码上我需要识别具有 2 个字符状态后跟一个5 位邮政编码并替换邮政编码中的任何非数字字符。我可能在字符串末尾有 OK 7-41.03 我需要删除连字符和句点。我知道 re.sub('[^0-9]+', '', '7-41.03') 将删除所需的字符，但我只需要它替换在字符串末尾找到的数字中的字符，并且只有在前面有一个用 OK 等空格包裹的两个字符状态时。似乎如果我向正则表达式添加任何东西作为后视表达式，那么我似乎无法替换字符。我想出了以下内容，但我认为必须有一个更简单的表达式来完成此操作。示例：

>>> s = 'AT&T RESOURCES, LEC\n15 EAST STH STREET, SUITE 2200\nTULSA, OK 7-41.03'  
>>> s[:re.search('(?<= [A-Z]{2} )[0-9\.-]+$', s).start()] + \  
...     re.sub('[^0-9]+', '', s[re.search('(?<= [A-Z]{2} )[0-9\.-]+$', s).start():])  
'AT&T RESOURCES, LEC\n15 EAST STH STREET, SUITE 2200\nTULSA, OK 74103'

这可行，但正在寻找更简单的方法。谢谢。

Answer 1

您需要使用 re.sub 回调：

# Our text
s = 'AT&T RESOURCES, LEC\n15 EAST STH STREET, SUITE 2200\nTULSA, OK 7-41.03'

# A function to be called upon
def repl(m):
    # Remove any non-digit chars
    return re.sub('\D+', '', m.group(0))

# Find 2 capital letters and capture the assumed zip code after it
# Pass the matches to repl
print re.sub('(?<= [A-Z][A-Z] )\S+', repl, s)

我不是 Python 开发人员，但希望以上代码能够根据我在

中找到的内容运行

什么是最好的正则表达式来替换 python 中某个短语前面的字符串中的非数字字符？

what is the best regular expression to replace non numeric character in a string preceded by certain phrase in python?

python

regex

python-2.7