使用 Fread 读取带有双引号和逗号的字段时出错
Error reading Field with Double Quotes and Commas using Fread
我有一个包含 19 列 character/numeric 数据的大型 csv 文件。
在 运行 fread 后,我收到一条错误消息,说我的数字列之一正在转换为字符,因为该字段具有值 ""
。然后我在文本编辑器中打开我的数据,找到了问题的根源。在一行中,字符列显示为:
"""PARENTS"", ""Y.M."", AND ""EXPECTING"""
对应的字符串:
"PARENTS", "Y.M.", AND "EXPECTING"
作为:
- 第一个引号是字符串保护符
- 第2至第6对引号为单引号
- 最后一个引号是字符串保护器的收盘价。
据我之前所见,fread 会读取此转换 ""
到 \"
。这种情况下的问题是该字符串还包含逗号。这些被解释为定界符,这打乱了我的列顺序并将后面的字符列推入我的数字字段。
有没有办法阻止它,或者我应该使用另一个软件包?
注意:我四处寻找解决方案,感觉 ""
+ fread 是令人沮丧的根源,但还没有看到添加逗号复杂性的示例。
转载:
将以下内容放入 txt 文件中:
"A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S"
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"
读取数据:
DT <- fread("myfile.csv",colClasses = c(rep("Character",5),
rep("numeric",2),
rep("character",12))
,sep = ",")
在当前开发 v1.9.5 中对 fread()
进行了最近的修复,这就是我得到的:
require(data.table) #v1.9.5+
fread('A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"')
# A B C
# 1: 168263291 Gruner & Jahr Printing and Publishing Company Parents Ym and Expecting
D E F G H I
# 1: ""PARENTS"", ""Y.M."", AND ""EXPECTING"" 0 0 3 73130201 055302756
J K L M N O P
# 1: Quad/Graphics Inc. 013034588 02 093671063 000000000 Unclassified 94133
Q R S
# 1: San Francisco CALIFORNIA UNITED STATES
fread()
更稳健地处理嵌入引号,默认去除空格(新 strip.white
参数,默认=TRUE
),并且还获得了 encoding
参数。请参阅项目页面上的 README
以获取最新消息。
如果您的问题仍未解决(在此处或在项目页面上)并提供可重现的示例,请告诉我们。
我有一个包含 19 列 character/numeric 数据的大型 csv 文件。
在 运行 fread 后,我收到一条错误消息,说我的数字列之一正在转换为字符,因为该字段具有值 ""
。然后我在文本编辑器中打开我的数据,找到了问题的根源。在一行中,字符列显示为:
"""PARENTS"", ""Y.M."", AND ""EXPECTING"""
对应的字符串:
"PARENTS", "Y.M.", AND "EXPECTING"
作为:
- 第一个引号是字符串保护符
- 第2至第6对引号为单引号
- 最后一个引号是字符串保护器的收盘价。
据我之前所见,fread 会读取此转换 ""
到 \"
。这种情况下的问题是该字符串还包含逗号。这些被解释为定界符,这打乱了我的列顺序并将后面的字符列推入我的数字字段。
有没有办法阻止它,或者我应该使用另一个软件包?
注意:我四处寻找解决方案,感觉 ""
+ fread 是令人沮丧的根源,但还没有看到添加逗号复杂性的示例。
转载:
将以下内容放入 txt 文件中:
"A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S"
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"
读取数据:
DT <- fread("myfile.csv",colClasses = c(rep("Character",5),
rep("numeric",2),
rep("character",12))
,sep = ",")
在当前开发 v1.9.5 中对 fread()
进行了最近的修复,这就是我得到的:
require(data.table) #v1.9.5+
fread('A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"')
# A B C
# 1: 168263291 Gruner & Jahr Printing and Publishing Company Parents Ym and Expecting
D E F G H I
# 1: ""PARENTS"", ""Y.M."", AND ""EXPECTING"" 0 0 3 73130201 055302756
J K L M N O P
# 1: Quad/Graphics Inc. 013034588 02 093671063 000000000 Unclassified 94133
Q R S
# 1: San Francisco CALIFORNIA UNITED STATES
fread()
更稳健地处理嵌入引号,默认去除空格(新 strip.white
参数,默认=TRUE
),并且还获得了 encoding
参数。请参阅项目页面上的 README
以获取最新消息。
如果您的问题仍未解决(在此处或在项目页面上)并提供可重现的示例,请告诉我们。