如何使用 NLTK sent_tokenize 函数遍历包含文本的数据框列？

Question

我有以下数据框 (df)，它以 .csv 开头，其中包含几列，并加载到 Jupyter 笔记本中。我想将其中一列用作 nlp 脚本的语料库。当我尝试运行 sent_tokenize（甚至 word_tokenize）时，出现错误。下面是我的脚本和产生的错误：

import pandas as pd
my_data = pd.read_csv("my_data.csv")
my_data.head()  # one column in particular, "col5", will have my text data of interest
data = my_data  # to feed it into a generic shorter generic variable
data_corpus = data["col5"]  # creates a separate data frame that I will use as my corpus of interest

TEXT_COLUMN = "col5"
text = data[TEXT_COLUMN]

corpus_tokenized = sent_tokenize(text) # here is where I am running problems

产生的错误： ---------------------------------------------- ---------------------- NameError Traceback（最近调用最后）在 ----> 1 corpus_tokenized = sent_tokenize(文本)

NameError: 名称 'text' 未定义

# When I apply the sent_tokenize function to a single text instance, there are no problems:
this_sentence = "this sentence is in English"

['this sentence is in English']

Answer 1

apply 允许您以行方式或列方式应用函数。所以：

data["tokenized_text"] = data["col5"].apply(sent_tokenize)

应该可以

如何使用 NLTK sent_tokenize 函数遍历包含文本的数据框列？

How do I use the NLTK sent_tokenize function to loop through a data frame column containing text?

python

nlp

nltk