我应该如何删除这些包含 "the" 和 "I" 之类的推文？

Question

我正在尝试清理一堆推文，以便将它们用于 k-means 聚类。我编写了以下代码，应该去除每条推文中不需要的字符。

from nltk.corpus import stopwords
import nltk
import json

with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
    data = json.loads(f.readline())

tweets = []
for sentence in data:
    tokens = nltk.wordpunct_tokenize(sentence['text'])

    type(tokens)

    text = nltk.Text(tokens)
    type(text)
    words = [w.lower() for w in text if w.isalpha() and w not in 
                    stopwords.words('english') and w is not 'the']
    s = " "
    useful_sentence = s.join(words)
    tweets.append(useful_sentence)

print(tweets)

我正在尝试删除 "I" 和 "the" 等字词，但出于某种原因我不知道如何删除。如果我在经过循环后查看推文，"the" 一词仍然出现。

问题：推文中怎么还会出现"the"和"I"？我应该如何解决这个问题？

Answer 1

你试过降低 w in check 吗？

words = [w.lower() for w in text if w.isalpha() and w.lower() not in 
                    stopwords.words('english') and w.lower() is not 'the']

Answer 2

is（和is not）是（参考）身份检查。它比较两个变量名称是否指向内存中的同一个对象。通常这仅用于与 None 进行比较，或用于其他一些特殊情况。

在您的情况下，使用 != 运算符或 == 的否定与字符串 "the".

进行比较

另请参阅：Is there a difference between `==` and `is` in Python?

Answer 3

注意处理顺序。

这里有两个测试字符串供您参考：

THIS THE REMAINS.

this the is removed

因为 "THE" 不是 "the"。你过滤后小写，但你应该先小写再过滤。

对您来说是个坏消息：k-means 在像 twitter 这样嘈杂的短文本上效果非常糟糕。因为它对噪声敏感，并且 TFIDF 向量需要非常长的文本才能可靠。所以仔细验证你的结果，它们可能并不像第一次热情时看起来那么好。

我应该如何删除这些包含 "the" 和 "I" 之类的推文？

How should I strip these tweets of words like "the" and "I"?

cluster-analysis

nltk

k-means

python-3.x