从指定范围的值中从 CSV 导入

Question

我正在尝试读取 CSV 文件，但运行出现以下错误。

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1097 did not have 5 elements

进一步检查 CSV 文件后，我发现在第 1097 行附近有一个中断，并开始一个新的 header 年化数据（我现在对每月感兴趣）。

temp <- tempfile()
download.file("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip",temp, mode="wb")
unzip(temp, "F-F_Research_Data_Factors.CSV")
French <- read.table("F-F_Research_Data_Factors.CSV", sep=",", skip = 3, header=T, nrows = 100)

以上代码下载 zip 文件并将 CSV 文件导入 R 中的前 100 行，效果很好。然而，前 100 行（用于说明目的）是 1920 年代和 1930 年代的数据点，这不是我特别感兴趣的。

我的问题是，如何从第一个逗号分隔的 CSV 文件中的值导入数据，即 192607 (1926-07) 直到 195007 (1950-07) - 我能够导入最新的值通过更改 nrow = 1095 但这不是我想要实现的目标。

数据快照；

,Mkt-RF,SMB,HML,RF
192607,    2.96,   -2.30,   -2.87,    0.22
192608,    2.64,   -1.40,    4.19,    0.25
192609,    0.36,   -1.32,    0.01,    0.23

...第 1100 行

 Annual Factors: January-December 
,Mkt-RF,SMB,HML,RF
  1927,   29.47,   -2.46,   -3.75,    3.12
  1928,   35.39,    4.20,   -6.15,    3.56

Answer 1

我用了 read.csv 而不是 read.table

French <- read.csv("F-F_Research_Data_Factors.CSV", sep = ",", skip = 3, 
header = T )

并获得 1188 个观测值。我认为您可以从此处对数据集进行子集化。

Answer 2

文件中的第一个 table 位于前两个零长度行之间，因此这将在没有前后垃圾的情况下读入它，然后在指定的日期对其进行子集化：

# read first table in file
Lines <- readLines("F-F_Research_Data_Factors.CSV")
ix <- which(Lines == "")
DF0 <- read.csv(text = Lines[ix[1]:ix[2]])  # all rows in first table

# subset it to indicated dates
DF <- subset(DF0, X >= 192607 & X <= 195007)

注意： 如果我们想要所有的 table，看起来以逗号开头的行开始每个 table 并且空行结束它们（除了第一个空行出现在 tables 之前，所以使用上面的 Lines 这给出了一个列表 L，其第 i 个组件是文件中的第 i 个 table。

st <- grep("^,", Lines)  # starting line numbers
en <- which(Lines == "")[-1]  # ending line numbers
L <- Map(function(st, en) read.csv(text = Lines[st:en]), st, en)

从指定范围的值中从 CSV 导入

Importing from CSV from a specified range of values

csv

r

data-manipulation