如何从 NLP 的撇号以外的标点符号中去除字符串

Question

我正在使用以下 "fastest" 方法从字符串中删除标点符号：

text = file_open.translate(str.maketrans("", "", string.punctuation))

但是，它会从 shouldn't 等标记中删除所有标点符号，包括撇号，将其变为 shouldnt。

问题是我正在使用 NLTK 库作为停用词，标准停用词不包括没有撇号的示例，而是具有如果我使用 NLTK 分词器拆分文本时 NLTK 会生成的标记。例如 shouldnt 包含的停用词是 shouldn, shouldn't, t.

我可以添加额外的停用词或从 NLTK 停用词中删除撇号。但是这两种解决方案似乎都没有"correct"，因为我认为在进行标点符号清理时应该保留撇号。

有什么方法可以让我在快速清理标点符号时保留撇号？

Answer 1

使用

怎么样

text = file_open.translate(str.maketrans(",.", "  "))

并将您要忽略的其他字符添加到第一个字符串中。

Answer 2

>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"

Answer 3

编辑自。

import re

s = "This is a test string, with punctuation. This shouldn't fail...!"

text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)

这个returns:

This is a test string with punctuation This shouldn't fail

正则表达式解释：

[^] 匹配块引号
内的所有内容 \w 匹配任何单词字符（等于 [a-zA-Z0-9_]）
\d 匹配一个数字（等于 [0-9]）
\s 匹配任何空白字符（等于 [\r\n\t\f\v ]）
\' 按字面匹配字符 '（区分大小写）
+ 匹配一次和无限次，尽可能多次，按需回馈

你可以试试here。

如何从 NLP 的撇号以外的标点符号中去除字符串

How to strip string from punctuation except apostrophes for NLP

python

nlp

nltk