将字符串与文本进行比较以在正确的位置设置标点符号
Comparing strings to a text to set punctuation marks in the right places
因此我们需要将标点符号与此文本中的一段文本和短语相匹配:
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
我需要的输出是:
output = [['apples, and donuts'], ['a donut, i would']]
我是初学者,所以我在考虑使用 .replace() 但我不知道如何分割字符串并从文本中访问我需要的确切部分。你能帮我吗? (我不允许使用任何库)
你可以试试正则表达式
import re
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
print([re.findall(i[0].replace(" ", r"\W*"), text) for i in phrases])
输出
[['apples, and donuts'], ['a donut, i would']]
通过遍历 phrases
列表并将 space 替换为 \W*
,正则表达式 findall
方法将能够检测搜索词并忽略标点符号。
您可以删除文本中的所有标点符号,然后只使用纯子字符串搜索。那么您唯一的问题就是如何将找到的文本恢复或映射到原始文本。
您可以记住搜索文本中每个字母在文本中的原始位置。这是一个例子。我只是删除了每个短语周围的嵌套列表,因为它看起来没用,如果需要,您可以轻松地解释它。
from pprint import pprint
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = ['apples and donuts', 'a donut i would']
def find_phrase(text, phrases):
clean_text, indices = prepare_text(text)
res = []
for phr in phrases:
i = clean_text.find(phr)
if i != -1:
res.append(text[indices[i] : indices[i+len(phr)-1]+1])
return res
def prepare_text(text, punctuation='.,;!?'):
s = ''
ind = []
for i in range(len(text)):
if text[i] not in punctuation:
s += text[i]
ind.append(i)
return s, ind
if __name__ == "__main__":
pprint(find_phrase(text, phrases))
['apples, and donuts.', 'a donut, i would']
因此我们需要将标点符号与此文本中的一段文本和短语相匹配:
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
我需要的输出是:
output = [['apples, and donuts'], ['a donut, i would']]
我是初学者,所以我在考虑使用 .replace() 但我不知道如何分割字符串并从文本中访问我需要的确切部分。你能帮我吗? (我不允许使用任何库)
你可以试试正则表达式
import re
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
print([re.findall(i[0].replace(" ", r"\W*"), text) for i in phrases])
输出
[['apples, and donuts'], ['a donut, i would']]
通过遍历 phrases
列表并将 space 替换为 \W*
,正则表达式 findall
方法将能够检测搜索词并忽略标点符号。
您可以删除文本中的所有标点符号,然后只使用纯子字符串搜索。那么您唯一的问题就是如何将找到的文本恢复或映射到原始文本。
您可以记住搜索文本中每个字母在文本中的原始位置。这是一个例子。我只是删除了每个短语周围的嵌套列表,因为它看起来没用,如果需要,您可以轻松地解释它。
from pprint import pprint
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = ['apples and donuts', 'a donut i would']
def find_phrase(text, phrases):
clean_text, indices = prepare_text(text)
res = []
for phr in phrases:
i = clean_text.find(phr)
if i != -1:
res.append(text[indices[i] : indices[i+len(phr)-1]+1])
return res
def prepare_text(text, punctuation='.,;!?'):
s = ''
ind = []
for i in range(len(text)):
if text[i] not in punctuation:
s += text[i]
ind.append(i)
return s, ind
if __name__ == "__main__":
pprint(find_phrase(text, phrases))
['apples, and donuts.', 'a donut, i would']