如何检索所选单词周围的整个句子?
How to retrieve the whole sentence around a selected word?
我想找到一个选定的词,并提取从它之前的第一个句点 (.) 到它之后的第一个句点 (.) 的所有内容。
示例:
在文件调用中 'text.php'
'The price of blueberries has gone way up. In the year 2038 blueberries have
almost tripled in price from what they were ten years ago. Economists have
said that berries may going up 300% what they are worth today.'
代码示例:(我知道如果我使用这样的代码,我可以在单词 ['that'] 之前找到 +5,在单词之后找到 +5,但我想找到介于单词前后的句号。)
import re
text = 'The price of blueberries has gone way up, that might cause trouble for farmers.
In the year 2038 blueberries have almost tripled in price from what they were ten years
ago. Economists have said that berries may going up 300% what they are worth today.'
find =
re.search(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}that(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,5}", text)
done = find.group()
print(done)
return:
'blueberries has gone way up, that might cause trouble for farmers'
我希望它 return 每个句子都包含 ['that']。
示例return(我想要得到的):
'The price of blueberries has gone way up, that might cause trouble for farmers',
'Economists have said that berries may going up 300% what they are worth today'
这个函数应该可以完成工作:
old_text = 'test 1: test friendly, test 2: not friendly, test 3: test friendly, test 4: not friendly, test 5: not friendly'
replace_dict={'test 1':'tested 1','not':'very'}
函数:
def replace_me(text,replace_dict):
for key in replace_dict.keys():
text=text.replace(str(key),str(replace_dict[key]))
return text
结果:
print(replace_me(old_text,replace_dict))
Out: 'tested 1: test friendly, test 2: very friendly, test 3: test friendly, test 4: very friendly, test 5: very friendly'
我会这样做:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
for sentence in text.split('.'):
if 'that' in sentence:
print(sentence.strip())
.strip()
是否只是为了 trim 额外的空间,因为我在 .
.
上拆分
如果你确实想使用 re
模块,我会使用这样的东西:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"[^.]+that[^.]+", text)
results = map(lambda x: x.strip(), results)
print(results)
得到相同的结果。
注意事项:
如果句子中有thatcher
这样的词,句子也会被打印出来。在第一个解决方案中,您可以使用 if 'that' in sentence.split():
代替,以便将字符串拆分为单词,在第二个解决方案中,您可以使用 re.findall(r"[^.]+\bthat\b[^.]+", text)
(注意 \b
标记;这些代表单词边界)。
脚本依靠句号(.
)来限制句子。如果句子本身包含使用句点的单词,那么结果可能不是预期的结果(例如,对于句子 Dr. Tom is sick yet again today, so I'm substituting for him.
,脚本会发现 Dr
是一个句子,而 Tom is sick yet again today, so I'm substituting for him.
是另一个句子)
编辑:为了回答您在评论中的问题,我将进行以下更改:
解决方案 1:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
sentences = text.split('.')
for i, sentence in enumerate(sentences):
if 'almost' in sentence:
before = '' if i == 0 else sentences[i-1].strip()
middle = sentence.strip()
after = '' if i == len(sentences)-1 else sentences[i+1].strip()
print(". ".join([before, middle, after]))
解决方案 2:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"(?:[^.]+\. )?[^.]+almost[^.]+(?:[^.]+\. )?", text)
results = map(lambda x: x.strip(), results)
print(results)
请注意,这些可能会产生重叠的结果。例如。如果文本是 a. b. b. c.
,并且您尝试查找包含 b
的句子,您将得到 a. b. b
和 b. b. c
.
我想找到一个选定的词,并提取从它之前的第一个句点 (.) 到它之后的第一个句点 (.) 的所有内容。
示例:
在文件调用中 'text.php'
'The price of blueberries has gone way up. In the year 2038 blueberries have
almost tripled in price from what they were ten years ago. Economists have
said that berries may going up 300% what they are worth today.'
代码示例:(我知道如果我使用这样的代码,我可以在单词 ['that'] 之前找到 +5,在单词之后找到 +5,但我想找到介于单词前后的句号。)
import re
text = 'The price of blueberries has gone way up, that might cause trouble for farmers.
In the year 2038 blueberries have almost tripled in price from what they were ten years
ago. Economists have said that berries may going up 300% what they are worth today.'
find =
re.search(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}that(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,5}", text)
done = find.group()
print(done)
return:
'blueberries has gone way up, that might cause trouble for farmers'
我希望它 return 每个句子都包含 ['that']。
示例return(我想要得到的):
'The price of blueberries has gone way up, that might cause trouble for farmers',
'Economists have said that berries may going up 300% what they are worth today'
这个函数应该可以完成工作:
old_text = 'test 1: test friendly, test 2: not friendly, test 3: test friendly, test 4: not friendly, test 5: not friendly'
replace_dict={'test 1':'tested 1','not':'very'}
函数:
def replace_me(text,replace_dict):
for key in replace_dict.keys():
text=text.replace(str(key),str(replace_dict[key]))
return text
结果:
print(replace_me(old_text,replace_dict))
Out: 'tested 1: test friendly, test 2: very friendly, test 3: test friendly, test 4: very friendly, test 5: very friendly'
我会这样做:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
for sentence in text.split('.'):
if 'that' in sentence:
print(sentence.strip())
.strip()
是否只是为了 trim 额外的空间,因为我在 .
.
如果你确实想使用 re
模块,我会使用这样的东西:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"[^.]+that[^.]+", text)
results = map(lambda x: x.strip(), results)
print(results)
得到相同的结果。
注意事项:
如果句子中有
thatcher
这样的词,句子也会被打印出来。在第一个解决方案中,您可以使用if 'that' in sentence.split():
代替,以便将字符串拆分为单词,在第二个解决方案中,您可以使用re.findall(r"[^.]+\bthat\b[^.]+", text)
(注意\b
标记;这些代表单词边界)。脚本依靠句号(
.
)来限制句子。如果句子本身包含使用句点的单词,那么结果可能不是预期的结果(例如,对于句子Dr. Tom is sick yet again today, so I'm substituting for him.
,脚本会发现Dr
是一个句子,而Tom is sick yet again today, so I'm substituting for him.
是另一个句子)
编辑:为了回答您在评论中的问题,我将进行以下更改:
解决方案 1:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
sentences = text.split('.')
for i, sentence in enumerate(sentences):
if 'almost' in sentence:
before = '' if i == 0 else sentences[i-1].strip()
middle = sentence.strip()
after = '' if i == len(sentences)-1 else sentences[i+1].strip()
print(". ".join([before, middle, after]))
解决方案 2:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"(?:[^.]+\. )?[^.]+almost[^.]+(?:[^.]+\. )?", text)
results = map(lambda x: x.strip(), results)
print(results)
请注意,这些可能会产生重叠的结果。例如。如果文本是 a. b. b. c.
,并且您尝试查找包含 b
的句子,您将得到 a. b. b
和 b. b. c
.