将字符串与文本进行比较以在正确的位置设置标点符号

Question

因此我们需要将标点符号与此文本中的一段文本和短语相匹配：

text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]

我需要的输出是：

output = [['apples, and donuts'], ['a donut, i would']]

我是初学者，所以我在考虑使用 .replace() 但我不知道如何分割字符串并从文本中访问我需要的确切部分。你能帮我吗？（我不允许使用任何库）

Answer 1

你可以试试正则表达式

import re

text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
print([re.findall(i[0].replace(" ", r"\W*"), text) for i in phrases])

输出

[['apples, and donuts'], ['a donut, i would']]

通过遍历 phrases 列表并将 space 替换为 \W*，正则表达式 findall 方法将能够检测搜索词并忽略标点符号。

Answer 2

您可以删除文本中的所有标点符号，然后只使用纯子字符串搜索。那么您唯一的问题就是如何将找到的文本恢复或映射到原始文本。

您可以记住搜索文本中每个字母在文本中的原始位置。这是一个例子。我只是删除了每个短语周围的嵌套列表，因为它看起来没用，如果需要，您可以轻松地解释它。

from pprint import pprint

text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = ['apples and donuts', 'a donut i would']

def find_phrase(text, phrases):
    clean_text, indices = prepare_text(text)
    res = []
    for phr in phrases:
        i = clean_text.find(phr)
        if i != -1:
            res.append(text[indices[i] : indices[i+len(phr)-1]+1])

    return res

def prepare_text(text, punctuation='.,;!?'):
    s = ''
    ind = []
    for i in range(len(text)):
        if text[i] not in punctuation:
            s += text[i]
            ind.append(i)
    return s, ind

if __name__ == "__main__":
    pprint(find_phrase(text, phrases))

['apples, and donuts.', 'a donut, i would']

将字符串与文本进行比较以在正确的位置设置标点符号

Comparing strings to a text to set punctuation marks in the right places

python

string

slice

python-3.x