我想在 python 中使用正则表达式提取地址，我可以在其中使用回顾，以便我获得前面的 3-4 个包含地址的字符串

Question

text = ' My uncle is admitted in the hospital. the address of the hospital is \n Apollo Health City Campus, Jubilee Hills, Hyderabad - 500 033. '

现在我正在使用它作为我的正则表达式，但只是得到 'Hills' 而没有得到所需的输出。

re.findall(r'(\w\S+\s+)(?=Hyderabad){3}'

我想要的输出是 - 'Apollo Health City Campus, Jubilee Hills, Hyderabad - 500 033。'

我想写一个正则表达式，它可以帮助我提取城市名称之前的 3 到 4 个字符串，例如 'Hyderabad' 在这种情况下，以及原始字符串中是否存在特殊字符。

Answer 1

为什么正则表达式很可能是错误的方法

如上文 Tim Roberts 所述 - 这不是使用正则表达式可以最好地处理的问题。它需要比正则表达式更强大的工具。

您可以在中看到用于识别地址并将其拆分为街道地址、城市、邮政编码等元素的方法。我希望它能阐明这个问题的复杂性。

你的例子表明你实际上想做的是 extraction of information on entities like hospitals and / or their addresses. This can be handled using a Named Entity Recognition 训练有素的工具来检测文本中的此类实体。

如何构建前瞻性正则表达式

如果您使用以下正则表达式：

r'((\w\S+\s+){1,6})(?=Hyderabad){3}'

它将提取您想要的内容：

Apollo Health City Campus, Jubilee Hills,

请在此处查看 test example。请注意，感兴趣的部分是第一个匹配组 - 而不是整个匹配的文本。

Answer 2

您可以使用 deque:

from collections import deque

text = ' My uncle is admitted in the hospital. the address of the hospital is Apollo Health City Campus, Jubilee Hills, Hyderabad - 500 033. '

def guess_address(needle, string):
    stack, started = [], False
    de = deque(string.split())

    while de:
        word = de.pop()
        if word == needle:
            stack.append(word)
            started = True
        elif started and word[0].isupper():
            stack.append(word)
        elif started and word[0].islower():
            break

    return stack[::-1]

stack = guess_address('Hyderabad', text)
print(stack)

产生

['Apollo', 'Health', 'City', 'Campus,', 'Jubilee', 'Hills,', 'Hyderabad']

我想在 python 中使用正则表达式提取地址，我可以在其中使用回顾，以便我获得前面的 3-4 个包含地址的字符串

i want extract address using regex in python where i can use a lookbehind so that i get the preceding 3-4 strings that hold the address

python

regex

list

python-2.7

python-3.x

为什么正则表达式很可能是错误的方法

如何构建前瞻性正则表达式