如何使用 NLTK sent_tokenize 函数遍历包含文本的数据框列?
How do I use the NLTK sent_tokenize function to loop through a data frame column containing text?
我有以下数据框 (df),它以 .csv 开头,其中包含几列,并加载到 Jupyter 笔记本中。我想将其中一列用作 nlp 脚本的语料库。当我尝试 运行 sent_tokenize(甚至 word_tokenize)时,出现错误。下面是我的脚本和产生的错误:
import pandas as pd
my_data = pd.read_csv("my_data.csv")
my_data.head() # one column in particular, "col5", will have my text data of interest
data = my_data # to feed it into a generic shorter generic variable
data_corpus = data["col5"] # creates a separate data frame that I will use as my corpus of interest
TEXT_COLUMN = "col5"
text = data[TEXT_COLUMN]
corpus_tokenized = sent_tokenize(text) # here is where I am running problems
产生的错误:
---------------------------------------------- ---------------------- NameError Traceback(最近调用最后)在
----> 1 corpus_tokenized = sent_tokenize(文本)
NameError: 名称 'text' 未定义
# When I apply the sent_tokenize function to a single text instance, there are no problems:
this_sentence = "this sentence is in English"
['this sentence is in English']
apply
允许您以行方式或列方式应用函数。所以:
data["tokenized_text"] = data["col5"].apply(sent_tokenize)
应该可以
我有以下数据框 (df),它以 .csv 开头,其中包含几列,并加载到 Jupyter 笔记本中。我想将其中一列用作 nlp 脚本的语料库。当我尝试 运行 sent_tokenize(甚至 word_tokenize)时,出现错误。下面是我的脚本和产生的错误:
import pandas as pd
my_data = pd.read_csv("my_data.csv")
my_data.head() # one column in particular, "col5", will have my text data of interest
data = my_data # to feed it into a generic shorter generic variable
data_corpus = data["col5"] # creates a separate data frame that I will use as my corpus of interest
TEXT_COLUMN = "col5"
text = data[TEXT_COLUMN]
corpus_tokenized = sent_tokenize(text) # here is where I am running problems
产生的错误: ---------------------------------------------- ---------------------- NameError Traceback(最近调用最后)在 ----> 1 corpus_tokenized = sent_tokenize(文本)
NameError: 名称 'text' 未定义
# When I apply the sent_tokenize function to a single text instance, there are no problems:
this_sentence = "this sentence is in English"
['this sentence is in English']
apply
允许您以行方式或列方式应用函数。所以:
data["tokenized_text"] = data["col5"].apply(sent_tokenize)
应该可以