标记化时如何仅 return 个实际标记，而不是空变量？

Question

我有一个功能：

def remove_stopwords(text):
     return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts]

我的输入是一个带有标记化句子的列表：

input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']

假设stop_words包含单词：'this'、'is'、'an'、'of'和'my'，则输出I想得到的是：

desired_output = ['example', 'input']

但是，我现在得到的实际输出是：

actual_output = [[], [], [], ['example'], [], [], ['input']]

如何调整我的代码以获得此输出？

Answer 1

如果没有特定原因需要使用您的代码，您可以使用以下代码删除停用词。

wordsFiltered = []
def remove_stopwords(text):
    for w in text:
        if w not in stop_words:
            wordsFiltered.append(w)
    return wordsFiltered

input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']

stop_words = ['This', 'is', 'an', 'of', 'my']

print remove_stopwords(input)

输出：

['example', 'input']

Answer 2

您的问题有两种解决方案：

解决方案 1：

您的 remove_stopwords 需要一组文档才能正常工作，因此您可以像这样修改您的输入

input = [['This', 'is', 'an', 'example', 'of', 'my', 'input']]

解决方案 2：

您更改 remove_stopwords 函数以处理单个文档

def remove_stopwords(text):
     return [word for word in simple_preprocess(str(text), min_len = 2) if word not in stop_words]

标记化时如何仅 return 个实际标记，而不是空变量？

How to only return actual tokens, rather than empty variables when tokenizing?

python

tokenize

apply

gensim

解决方案 1：

解决方案 2：