Python NLTK 提取包含关键词的句子

Question

我的 objective 是从包含我的关键字列表中的任何单词的文本文件中提取句子。我的脚本清理文本文件并使用 NLTK 标记句子并删除停用词。脚本的那部分工作正常并产生看起来正确的输出 ['affirming updated 2020 range guidance long-term earnings dividend growth outlooks provided earlier month'、'finally look forward increasing engagement existing prospective investors months come'、'turn'] 我编写的用于提取包含关键字的句子的脚本无法按我想要的方式工作。它提取关键词而不是它们出现的句子。输出看起来像这样； [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'impact', 'zone']

    fileinC=nltk.sent_tokenize(fileinB)
    fileinD=[]
    for sent in fileinC:
        fileinD.append(' '.join(w for w in word_tokenize(sent) if w not in allinstops))
    fileinE=[sent.replace('\n', " ") for sent in fileinD]

    #extract sentences containing keywords
    fileinF=[]
        for sent in fileinE:
    fileinF.append(' '.join(w for w in word_tokenize(sent) if w  in keywords))

Answer 1

很可能是你最后一行的条件追加导致了这个问题，将它分解成更小的步骤更直观：

fileinF = []
for sent in fileinE:
    # tokenize and lowercase tokens of the sentence
    tokenized_sent = [word.lower() for word in word_tokenize(sent)]
    # if any item in the tokenized sentence is a keyword, append the original sentence
    if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(sent)

Python NLTK 提取包含关键词的句子

Python NLTK extract sentence containing a keyword

python

nltk