从数据框中删除标点符号和停用词

Question

我的数据框看起来像 -

State                           text
Delhi                  170 kw for330wp, shipping and billing in delhi...
Gujarat                4kw rooftop setup for home Photovoltaic Solar...
Karnataka              language barrier no requirements 1kw rooftop ...
Madhya Pradesh         Business PartnerDisqualified Mailed questionna...
Maharashtra            Rupdaypur, panskura(r.s) Purba Medinipur 150kw...

我想从此数据框中删除标点符号和停用词。我已经完成了以下代码。但它不起作用 -

import nltk
nltk.download('stopwords')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean

df['text'] = df['text'].apply(message_cleaning)

AttributeError: 'set' object has no attribute 'words'

Answer 1

问题： 我认为您与 stopwords 有名称冲突。您的笔记本中某处可能有一行您分配的位置：

stopwords = stopwords.words("english")

这可以解释这个问题，因为调用 stopwords 会变得模棱两可：您指的是变量而不是包。

解决方法：让事情变得明确：

首先分配一个引用停用词的变量（顺便说一句，这比每次都调用它要快）

from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english"))

在你的函数中使用它：

Test_punc_removed_join_clean = [
    word for word in Test_punc_removed_join.split() 
    if word.lower() not in english_stop_words
]

从数据框中删除标点符号和停用词

Remove punctuation and stop words from a data frame

nltk

python-3.x

pandas

scikit-learn