read.csv 结果行数比实际多

Question

当我尝试 read.csv this r 中的数据集时，输出的行数比实际数据集行数多：

setwd("D:/yelp_dataset")
data1=read.csv("star3650000c.csv",sep=",",header=TRUE,fill=TRUE,quote=" 
",na.strings=c("NA","?"),dec=".",comment.char=" 
",stringsAsFactors=FALSE)

我该怎么办？

Answer 1

我想阅读 table 不起作用的主要问题是您对引号和注释字符的定义包括换行符（至少关于您可以控制的事情，如果您的数据已损坏，您通常会丢失）。您可以将它们指定为合理的值，如下所示。请注意，我已设置 header = FALSE 以便于检查最终输出。

character_with_line_break = " 
"
# note that the line break is actually included in your character as "\n"
character_with_line_break
# [1] " \n"
# try read with different values for quote and comment characters
df =  read.csv("yelp.csv"
               ,sep=","
               ,header=FALSE
               ,fill=TRUE
                ,quote = "\""
                ,na.strings=c("NA","?")
                ,dec=".",comment.char=""
               ,stringsAsFactors=FALSE)
# there is still something wrong with the last line, 
# would have to investigate this further (probably missing EOL marker)
# but the final output looks good (see further down)
# Warning message:
#   In read.table(file = file, header = header, sep = sep, quote = quote,  :
#                   incomplete final line found by readTableHeader on 'yelp.csv'
dim(df)
# [1]  4 10
data.frame(lapply(df, function(x) substr(x, 1, 10)))
# V1         V2 V3       V4 V5         V6 V7         V8 V9        V10
# 1  0 uQJ5RNygSe  2 8/4/2011  1 afEfPToTLj  5 I took my   2 uiZMpQSqJ4
# 2  1 VcGyezSNtk  4 1/4/2011  1 lGLLA08Ql4  5 Delicious!  5 uiZMpQSqJ4
# 3  2 39YKi45Pet  1 8/9/2013  0     #NAME?  5 After many  1 uiZMpQSqJ4
# 4  3 UTTTKI61dC  4 3/9/2012  1 Ly5ky2bAoJ  5 Love this  10 uiZMpQSqJ4

read.csv 结果行数比实际多

read.csv resulted in more rows than really is there

r

rows

read.csv