当句子包含特殊字符时删除停用词

Removing stopwords when the sentence contains special characters

我有一个包含一些原始文本的 .xlsx 文件。我正在将文件读入 DataFrame,然后尝试从中删除符号和停用词。我确实已经实现了满足这两种需求的功能,但是我 运行 遇到以下问题:

下面是删除符号的样子:

regex = r'[^\w\s]'
self.dataframe = self.dataframe.replace(regex, '', regex=True)

并删除停用词:

self.dataframe[col] = column.apply(lambda x: ' '.join(
            [item for item in x.split() if item not in stops]))

有没有优雅的解决方案?也欢迎任何建议。

我们首先必须将缩写词替换为完整词以使其更具可读性,例如将 they're 替换为 they are, I'd with I would, I'll with I will, won't with不会等。一旦我们有更好可读的词,就可以删除停用词。请参阅以下示例,将缩写词转换为完整词,然后删除停用词。

import re
sent = "I'll have a bike. They're good. I won't do. I'd be happy"
for i in sent.split():
    sent_replace = re.sub(r"\'re", " are", sent)
    sent_replace = re.sub(r"\'d", " would", sent_replace)
    sent_replace = re.sub(r"\'ll", " will", sent_replace)
    sent_replace = re.sub(r"won't", "would not", sent_replace)

print('Before:', sent)
print('\nAfter:', sent_replace)

no_stop_words = ' '.join(item for item in sent_replace.split() if item not in stopwords.words('english'))
print('\nNo stop words:', no_stop_words)

参考下面的输出截图