删除自定义停用词

Remove custom stopwords

我正在尝试在 NLP 预处理步骤中删除停用词。我使用 gensim 中的 remove_stopwords() 函数,但也想添加我自己的停用词

# under this method, these custom stopwords still show up after processing
custom_stops = ["stopword1", "stopword2"]
data_text['text'].apply(lambda x: [item for item in x if item not in custom_stops])
# remove stopwords with gensim
data_text['filtered_text'] = data_text['text'].apply(lambda x: remove_stopwords(x.lower()))
# split the sentences into a list
data_text['filtered_text'] = data_text['filtered_text'].apply(lambda x: str.split(x))

程序从字符串中删除所有非自定义停用词后,您可以执行以下操作来删除自定义停用词:

custom_stops = ["stopword1", "stopword2"]

s = 'I am very stopword1 and also very stopword2!'

for c in custom_stops:
    s = s.replace(c,'').replace('  ',' ')

print(s)

输出:

I am very and also very !

我能够让它与以下内容一起工作:

custom_stops = ["stopword1", "stopword2"]
# remove stopwords with gensim
data_text['filtered_text'] = data_text['text'].apply(lambda x: remove_stopwords(x.lower()))
# split the sentence
data_text['filtered_text'] = data_text['filtered_text'].apply(lambda x: str.split(x))
# remove the custom stopwords
data_text['filtered_text'] = data_text['filtered_text'].apply(lambda x: [item for item in x if item.lower() not in custom_stops])