当句子包含特殊字符时删除停用词

Question

我有一个包含一些原始文本的 .xlsx 文件。我正在将文件读入 DataFrame，然后尝试从中删除符号和停用词。我确实已经实现了满足这两种需求的功能，但是我运行遇到以下问题：

如果我在删除停用词之前删除符号，“isnt”、“theyre”等内容将保留在数据框中。
如果我删除符号前的停用词，像“(the”这样的东西不会算作停用词并保留在数据帧上。

下面是删除符号的样子：

regex = r'[^\w\s]'
self.dataframe = self.dataframe.replace(regex, '', regex=True)

并删除停用词：

self.dataframe[col] = column.apply(lambda x: ' '.join(
            [item for item in x.split() if item not in stops]))

有没有优雅的解决方案？也欢迎任何建议。

Answer 1

我们首先必须将缩写词替换为完整词以使其更具可读性，例如将 they're 替换为 they are, I'd with I would, I'll with I will, won't with不会等。一旦我们有更好可读的词，就可以删除停用词。请参阅以下示例，将缩写词转换为完整词，然后删除停用词。

import re
sent = "I'll have a bike. They're good. I won't do. I'd be happy"
for i in sent.split():
    sent_replace = re.sub(r"\'re", " are", sent)
    sent_replace = re.sub(r"\'d", " would", sent_replace)
    sent_replace = re.sub(r"\'ll", " will", sent_replace)
    sent_replace = re.sub(r"won't", "would not", sent_replace)

print('Before:', sent)
print('\nAfter:', sent_replace)

no_stop_words = ' '.join(item for item in sent_replace.split() if item not in stopwords.words('english'))
print('\nNo stop words:', no_stop_words)

参考下面的输出截图

当句子包含特殊字符时删除停用词

Removing stopwords when the sentence contains special characters

nlp

nltk

pandas