在文本文件中查找一个关键词，并捕获这个词后面的n个词

Question

我正在做一个基本的文本挖掘应用程序，我需要找到一个确定的词（关键字）并仅捕获该词之后的 n 个词。例如，在本文中，我想捕获关键字 POPULATION:

之后的 3 个词

补充表格包含 59 个详细表格，这些表格列出了人口 20,000 人 [=] 的 2016 年 1 年微观数据16=]或更多。这些补充估计可通过 American FactFinder 和人口普查局的应用程序编程界面获得，其地理汇总级别与美国社区调查中的那些相同。

下一步将是拆分字符串并找到数字，但这是我已经解决的问题。我尝试过不同的方法（正则表达式等）但没有成功。我该怎么做？

Answer 1

将文本拆分成单词，找到关键字的索引，抓取下一个索引处的单词：

text = 'The Supplemental Tables consist of 59 detailed tables tabulated on the 2016 1-year microdata for geographies with populations of 20,000 people or more. These Supplemental Estimates are available through American FactFinder and the Census Bureau’s application programming interface at the same geographic summary levels as those in the American Community Survey.'
keyword = 'populations'
words = text.split()
index = words.index(keyword)
wanted_words = words[index + 1:index + 4]

如果您希望将三个单词的列表 wanted_words 变回字符串，请使用

wanted_text = ' '.join(wanted_words)

Answer 2

您可以使用 nltk 库。

from nltk.tokenize import word_tokenize

def sample(string, keyword, n):
    output = []
    word_list = word_tokenize(string.lower())
    indices = [i for i, x in enumerate(word_list) if x==keyword]
    for index in indices:
        output.append(word_list[index+1:index+n+1])
    return output


>>>print sample(string, 'populations', 3)
>>>[['of', '20,000', 'people']]
>>>print sample(string, 'tables', 3)
>>>[['consist', 'of', '59'], ['tabulated', 'on', 'the']]

Answer 3

你有两种解决方法

1 使用解吧

jieba.cut

它可以把你的句子变成单词

只需找到 'populations' 并获得接下来的三个词

2 使用 spilt

raw = 'YOUR_TEXT_CONTENT'
raw_list = raw.split(' ')
start = raw_list.index('populations')
print(raw_list[start:start+4])

在文本文件中查找一个关键词，并捕获这个词后面的n个词

Find a keyword in a text file and catch the n words after this word

python

text-mining