试图将我的数据点分成多个数组,而不是一个大数组
Trying to separate my data points into multiple arrays, instead of having one big array
我正在做一个 nlp 项目并且正在处理假新闻,其中一个输入是头条新闻。我已将我的标题标记为以下格式:
[['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'], ['Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'to', 'modern', 'America', ',', 'says', 'star'], ['Trump', '’', 's', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']
现在,每个标题都在自己的二维数组中。但是,当我删除停用词时,它变成了:
['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump', 'Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'modern', 'America', ',', 'says', 'star', 'Trump', '’', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']
每个单词都是一维数组中它自己的元素。我想让每个标题都有自己的数组,就像标记化数组一样。我该怎么做?
这是我的代码:
data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]
x = np.array(data['Headline'])
y = np.array(data["Label"])
# tokenization of the data here'
headline_vector = []
for headline in x:
headline_vector.append(word_tokenize(headline))
#print(headline_vector)
stopwords = set(stopwords.words('english'))
#removing stopwords at this part
filtered = []
for sentence in headline_vector:
for word in sentence:
if word not in stopwords:
filtered.append(word)
您正在遍历每个单词并一次将它们附加到列表中,这就是它变平的原因。您不需要附加每个单词,而是附加过滤后的列表。如果您将其作为列表理解,这可能会更清楚:
headline_vector = [['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'], ['Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'to', 'modern', 'America', ',', 'says', 'star'], ['Trump', '’', 's', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']]
stopwords = set(["'s", "to", "His", ","])
filtered = [[word for word in sentence if word not in stopwords]
for sentence in headline_vector]
结果:
[['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'],
['Linklater', 'war','veteran',...]
...etc
]
您可以使用 filter()
获得相同的效果:
filtered = [list(filter(lambda word: word not in stopwords, sentence))
for sentence in headline_vector]
我正在做一个 nlp 项目并且正在处理假新闻,其中一个输入是头条新闻。我已将我的标题标记为以下格式:
[['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'], ['Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'to', 'modern', 'America', ',', 'says', 'star'], ['Trump', '’', 's', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']
现在,每个标题都在自己的二维数组中。但是,当我删除停用词时,它变成了:
['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump', 'Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'modern', 'America', ',', 'says', 'star', 'Trump', '’', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']
每个单词都是一维数组中它自己的元素。我想让每个标题都有自己的数组,就像标记化数组一样。我该怎么做?
这是我的代码:
data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]
x = np.array(data['Headline'])
y = np.array(data["Label"])
# tokenization of the data here'
headline_vector = []
for headline in x:
headline_vector.append(word_tokenize(headline))
#print(headline_vector)
stopwords = set(stopwords.words('english'))
#removing stopwords at this part
filtered = []
for sentence in headline_vector:
for word in sentence:
if word not in stopwords:
filtered.append(word)
您正在遍历每个单词并一次将它们附加到列表中,这就是它变平的原因。您不需要附加每个单词,而是附加过滤后的列表。如果您将其作为列表理解,这可能会更清楚:
headline_vector = [['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'], ['Linklater', "'s", 'war', 'veteran', 'comedy', 'speaks', 'to', 'modern', 'America', ',', 'says', 'star'], ['Trump', '’', 's', 'Fight', 'With', 'Corker', 'Jeopardizes', 'His', 'Legislative', 'Agenda']]
stopwords = set(["'s", "to", "His", ","])
filtered = [[word for word in sentence if word not in stopwords]
for sentence in headline_vector]
结果:
[['Four', 'ways', 'Bob', 'Corker', 'skewered', 'Donald', 'Trump'],
['Linklater', 'war','veteran',...]
...etc
]
您可以使用 filter()
获得相同的效果:
filtered = [list(filter(lambda word: word not in stopwords, sentence))
for sentence in headline_vector]