读取csv文件时如何指定文本列?
How to specify a text column when read a csv file?
我用这种方式读取一个csv文件:
这里是 str()
$ an_id : int 4840 41981 40482 37473 33278 29083 30940 29374 24023 23922 ...
好像是int字符列,用下面的转换成chr
df$an_id <- paste0("doc_", df$an_id)
然而,当我执行这个命令时,我收到这个错误:
toks <- corpus(df, docid_field = "an_id") %>%
tokens()
Error in corpus.data.frame(df, docid_field = "an_id") :
column name text not found
是否有任何不同的方式来读取文件或将列作为文本传递?
如果我将此数据保存到 csv 文件并读取文件和 运行 命令,它们将正常工作
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
如@Nathalie 的评论中所述,如果数据在 data.frame 中,则以下内容可以解决问题。 docid_field 对文档 ID 列的引用和 text_field 应引用包含文本的列。
toks <- corpus(df,
docid_field = "an_id",
text_field = "text") %>%
tokens()
str(toks)
List of 4
$ doc_1: chr "here"
$ doc_2: chr [1:39] "This" "dataset" "contains" "movie" ...
$ doc_3: chr [1:36] "The" "core" "dataset" "contains" ...
$ doc_4: chr [1:105] "There" "are" "two" "top-level" ...
- attr(*, "types")= chr [1:102] "here" "This" "dataset" "contains" ...
- attr(*, "padding")= logi FALSE
- attr(*, "class")= chr "tokens"
- attr(*, "what")= chr "word"
- attr(*, "ngrams")= int 1
- attr(*, "skip")= int 0
- attr(*, "concatenator")= chr "_"
- attr(*, "docvars")='data.frame': 4 obs. of 0 variables
数据:
df <- structure(list(an_id = c("doc_1", "doc_2", "doc_3", "doc_4"),
text = c("here", "This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.",
"The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.",
"There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."
)), row.names = c(NA, -4L), class = "data.frame")
我用这种方式读取一个csv文件:
这里是 str()
$ an_id : int 4840 41981 40482 37473 33278 29083 30940 29374 24023 23922 ...
好像是int字符列,用下面的转换成chr
df$an_id <- paste0("doc_", df$an_id)
然而,当我执行这个命令时,我收到这个错误:
toks <- corpus(df, docid_field = "an_id") %>%
tokens()
Error in corpus.data.frame(df, docid_field = "an_id") : column name text not found
是否有任何不同的方式来读取文件或将列作为文本传递?
如果我将此数据保存到 csv 文件并读取文件和 运行 命令,它们将正常工作
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
如@Nathalie 的评论中所述,如果数据在 data.frame 中,则以下内容可以解决问题。 docid_field 对文档 ID 列的引用和 text_field 应引用包含文本的列。
toks <- corpus(df,
docid_field = "an_id",
text_field = "text") %>%
tokens()
str(toks)
List of 4
$ doc_1: chr "here"
$ doc_2: chr [1:39] "This" "dataset" "contains" "movie" ...
$ doc_3: chr [1:36] "The" "core" "dataset" "contains" ...
$ doc_4: chr [1:105] "There" "are" "two" "top-level" ...
- attr(*, "types")= chr [1:102] "here" "This" "dataset" "contains" ...
- attr(*, "padding")= logi FALSE
- attr(*, "class")= chr "tokens"
- attr(*, "what")= chr "word"
- attr(*, "ngrams")= int 1
- attr(*, "skip")= int 0
- attr(*, "concatenator")= chr "_"
- attr(*, "docvars")='data.frame': 4 obs. of 0 variables
数据:
df <- structure(list(an_id = c("doc_1", "doc_2", "doc_3", "doc_4"),
text = c("here", "This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.",
"The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.",
"There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."
)), row.names = c(NA, -4L), class = "data.frame")