Python的单词列表列表：

Question

有一长串评论（说了 50 条），例如：

"this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated".

我想使用 python 创建一个保留句子分词的单词列表。

删除停用词后，我想要所有 50 条评论的结果，其中句子标记被保留，单词标记被保留到每个标记化的句子中。最后我希望结果类似于：

list(c("disappointment", "trip"), 
     c("restaurant", "received", "good", "reviews", "expectations", "high"), 
     c("service", "slow", "even", "though", "restaurant", "full"),
     c("house", "salad", "come", "us"), 
     c("although", "tasty", "reminded", "pulled"), 
     "restaurant")

我怎么能在 python 中做到这一点？在这种情况下，R 是一个好的选择吗？非常感谢您的帮助。

Answer 1

不确定你是否需要 R，但根据你的要求，我认为它也可以用纯 pythonic 方式完成。

你基本上想要一个列表，其中包含每个句子的重要单词（不是停用词）的小列表。

所以你可以做类似的事情

input_reviews = """
this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. 
the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. 
the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated.
"""

# load your stop words list here
stop_words_list = ['this', 'was', 'the', 'of', 'our', 'biggest', 'had', 'some', 'very', 'so', 'were', 'not']


def main():
    sentences = input_reviews.split('.')
    sentence_list = []
    for sentence in sentences:
        inner_list = []
        words_in_sentence = sentence.split(' ')
        for word in words_in_sentence:
            stripped_word = str(word).lstrip('\n')
            if stripped_word and stripped_word not in stop_words_list:
                # this is a good word
                inner_list.append(stripped_word)

        if inner_list:
            sentence_list.append(inner_list)

    print(sentence_list)



if __name__ == '__main__':
    main()

在我这边，输出

[['disappointment', 'trip'], ['restaurant', 'received', 'good', 'reviews,', 'expectations', 'high'], ['service', 'slow', 'even', 'though', 'restaurant', 'full'], ['I', 'house', 'salad', 'which', 'could', 'have', 'come', 'out', 'any', 'sizzler', 'in', 'us'], ['keshi', 'yena,', 'although', 'tasty', 'reminded', 'me', 'barbequed', 'pulled', 'chicken'], ['restaurant', 'is', 'overrated']]

Answer 2

这是一种方法。您可能需要根据您的应用程序初始化 stop_words。我假设 stop_words 是小写的：因此，在原始句子上使用 lower() 进行比较。 sentences.lower().split('.')给出句子。 s.split() 给出每个句子中的单词列表。

stokens = [list(filter(lambda x: x not in stop_words, s.split())) for s in sentences.lower().split('.')]

您可能想知道为什么我们使用 filter 和 lambda。另一种方法是这样，但这会给出一个简单的列表，因此不合适：

stokens = [word for s in sentences.lower().split('.') for word in s.split() if word not in stop_words]

filter 是一个函数式编程结构。它帮助我们处理整个列表，在这种情况下，通过使用 lambda 语法的匿名函数。

Answer 3

如果您不想手动创建停用词列表，我建议您使用 python 中的 nltk 库。它还处理句子拆分（而不是在每个句点上拆分）。解析您的句子的示例可能如下所示：

import nltk
stop_words = set(nltk.corpus.stopwords.words('english'))
text = "this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated"
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sentence_detector.tokenize(text.strip())
results = []
for sentence in sentences:
    tokens = nltk.word_tokenize(sentence)
    words = [t.lower() for t in tokens if t.isalnum()]
    not_stop_words = tuple([w for w in words if w not in stop_words])
    results.append(not_stop_words)
print results

但是，请注意，这不会给出与您问题中列出的完全相同的输出，而是如下所示：

[('biggest', 'disappointment', 'trip'), ('restaurant', 'received', 'good', 'reviews', 'expectations', 'high'), ('service', 'slow', 'even', 'though', 'restaurant', 'full'), ('house', 'salad', 'could', 'come', 'sizzler', 'us'), ('keshi', 'yena', 'although', 'tasty', 'reminded', 'barbequed', 'pulled', 'chicken'), ('restaurant', 'overrated')]

如果输出需要看起来相同，您可能需要手动添加一些停用词。

Python的单词列表列表：

List of list of words by Python:

python

word-list