使用 NLTK 预处理存储在 DataFrame 中的语料库

Question

我正在学习 NLP，我正在尝试了解如何对存储在 pandas DataFrame 中的语料库执行预处理。所以假设我有这个：

import pandas as pd

doc1 = """"Whitey on the Moon" is a 1970 spoken word poem by Gil Scott-Heron. It was released as the ninth track on Scott-Heron's debut album Small Talk at 125th and Lenox. It tells of medical debt and poverty experienced during the Apollo Moon landings. The poem critiques the resources spent on the space program while Black Americans were experiencing marginalization. "Whitey on the Moon" was prominently featured in the 2018 biographical film about Neil Armstrong, First Man."""
doc2 = """St Anselm's Church is a Roman Catholic church which is part of the Personal Ordinariate of Our Lady of Walsingham in Pembury, Kent, England. It was originally founded in the 1960s as a chapel-of-ease before becoming its own quasi-parish within the personal ordinariate in 2011, following a conversion of a large number of disaffected Anglicans in Royal Tunbridge Wells."""
doc3 = """Nymphargus grandisonae (common name: giant glass frog, red-spotted glassfrog) is a species of frog in the family Centrolenidae. It is found in Andes of Colombia and Ecuador. Its natural habitats are tropical moist montane forests (cloud forests); larvae develop in streams and still-water pools. Its habitat is threatened by habitat loss, introduced fish, and agricultural pollution, but it is still a common species not considered threatened by the IUCN."""

df = pd.DataFrame({'text': [doc1, doc2, doc3]})

这导致：

+---+---------------------------------------------------+
|   |                                              text |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+

现在，我加载我需要的内容并对文本进行标记化：

import nltk
import string
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

df['tokenized_text'] = df['text'].apply(word_tokenize)
df

给出以下输出：

+---+---------------------------------------------------+---------------------------------------------------+
|   |                                              text |                                    tokenized_text |
+---+---------------------------------------------------+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... |
+---+---------------------------------------------------+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... |
+---+---------------------------------------------------+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, (, common, name, :, ... |
+---+---------------------------------------------------+---------------------------------------------------+

现在，我的问题出现在删除停用词时：

df['tokenized_text'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not  in [stop_words] + list(string.punctuation)])

看起来什么都没发生：

+---+---------------------------------------------------+---------------------------------------------------+
|   |                                              text |                                    tokenized_text |
+---+---------------------------------------------------+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... |
+---+---------------------------------------------------+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... |
+---+---------------------------------------------------+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, common, name, giant,... |
+---+---------------------------------------------------+---------------------------------------------------+

有人可以帮助我了解发生了什么以及我应该怎么做吗？

之后，我想应用词形还原，但在当前状态下不起作用：

lemmatizer = WordNetLemmatizer
df['tokenized_text'] = df['tokenized_text'].apply(lemmatizer.lemmatize)

产量：

TypeError: lemmatize() missing 1 required positional argument: 'word'

谢谢！

Answer 1

第一期

stop_words = set(stopwords.words('english')) 和 ... if word not in [stop_words]：您创建了一个只有一个元素的集合 - 停用词列表。没有 word 等于整个列表，因此不会删除停用词。所以它必须是：
stop_words = stopwords.words('english')
df['tokenized_text'].apply(lambda words: [word for word in words if word not in stop_words + list(string.punctuation)])

第二期

lemmatizer = WordNetLemmatizer 这里你分配了 class 但你需要创建这个 class 的对象： lemmatizer = WordNetLemmatizer()

第三期

您不能一次性对整个列表进行词形还原，而是需要逐字词词形还原： df['tokenized_text'].apply(lambda words: [lemmatizer.lemmatize(word) for word in words])

使用 NLTK 预处理存储在 DataFrame 中的语料库

Preprocessing corpus stored in DataFrame with NLTK

python

nlp

nltk

dataframe

pandas