如何使用 NLTK sent_tokenize 函数遍历包含文本的数据框列?

How do I use the NLTK sent_tokenize function to loop through a data frame column containing text?

我有以下数据框 (df),它以 .csv 开头,其中包含几列,并加载到 Jupyter 笔记本中。我想将其中一列用作 nlp 脚本的语料库。当我尝试 运行 sent_tokenize(甚至 word_tokenize)时,出现错误。下面是我的脚本和产生的错误:

import pandas as pd
my_data = pd.read_csv("my_data.csv")
my_data.head()  # one column in particular, "col5", will have my text data of interest
data = my_data  # to feed it into a generic shorter generic variable
data_corpus = data["col5"]  # creates a separate data frame that I will use as my corpus of interest

TEXT_COLUMN = "col5"
text = data[TEXT_COLUMN]

corpus_tokenized = sent_tokenize(text) # here is where I am running problems

产生的错误: ---------------------------------------------- ---------------------- NameError Traceback(最近调用最后)在 ----> 1 corpus_tokenized = sent_tokenize(文本)

NameError: 名称 'text' 未定义

# When I apply the sent_tokenize function to a single text instance, there are no problems:
this_sentence = "this sentence is in English"

['this sentence is in English']

apply 允许您以行方式或列方式应用函数。所以:

data["tokenized_text"] = data["col5"].apply(sent_tokenize)

应该可以