NLTK 停用词无法识别句子中的 'i'
NLTK stop words not recognizing 'i' in a sentence
这是代码。我想从句子中删除所有停用词。 'i'.
这个词我还是懂的
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
en_stops=set(stopwords)
x='I am a good boy. I always pay by debts'
[item.lower().rstrip() for item in x.split() if item not in en_stops]
我得到的输出:
['i', 'good', 'boy.', 'i', 'always', 'pay', 'debts']
NLTK 停用词全部小写。因此,在进行成员资格检查之前,您还需要将单词转换为小写。您可以更改代码片段的最后一行以使其工作:
[item.rstrip() for item in x.lower().split() if item not in en_stops]
更新:
正如评论中所建议的,为了更稳健,我们可以使用 in-built 分词器而不是 string.split()
来处理标点符号。在这种情况下,代码片段将如下所示:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
stopwords = stopwords.words('english')
en_stops=set(stopwords)
x = 'I am a good boy. I always pay by debts'
tokenized_sentences = list()
exclusion_set = en_stops.union(string.punctuation)
for sent in sent_tokenize(x):
tokenized_sentences.append([word for word in word_tokenize(sent.lower()) if word not in exclusion_set])
标记化的句子如下所示:
[['good', 'boy'], ['always', 'pay', 'debts']]
这是代码。我想从句子中删除所有停用词。 'i'.
这个词我还是懂的from nltk.corpus import stopwords
stopwords = stopwords.words('english')
en_stops=set(stopwords)
x='I am a good boy. I always pay by debts'
[item.lower().rstrip() for item in x.split() if item not in en_stops]
我得到的输出:
['i', 'good', 'boy.', 'i', 'always', 'pay', 'debts']
NLTK 停用词全部小写。因此,在进行成员资格检查之前,您还需要将单词转换为小写。您可以更改代码片段的最后一行以使其工作:
[item.rstrip() for item in x.lower().split() if item not in en_stops]
更新:
正如评论中所建议的,为了更稳健,我们可以使用 in-built 分词器而不是 string.split()
来处理标点符号。在这种情况下,代码片段将如下所示:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
stopwords = stopwords.words('english')
en_stops=set(stopwords)
x = 'I am a good boy. I always pay by debts'
tokenized_sentences = list()
exclusion_set = en_stops.union(string.punctuation)
for sent in sent_tokenize(x):
tokenized_sentences.append([word for word in word_tokenize(sent.lower()) if word not in exclusion_set])
标记化的句子如下所示:
[['good', 'boy'], ['always', 'pay', 'debts']]